What "resume" usually means (and doesn't)
I've lost a 4 GB download to hotel Wi-Fi twice and decided that was enough. Most download tools claim to support resume. Pull the Wi-Fi at 2.1 GB of a 3 GB file and watch what actually happens. Some start over at byte zero. The 2.1 GB you already had is gone. Some keep the partial file, reconnect, and start appending from wherever the last write landed, which is usually a coin flip between "the byte after the last successful flush" and "somewhere inside a buffer that never made it to disk." Resume works for the demo and fails for the drive home.
Resume is an architectural decision you either make on day one or you don't. You can't sprinkle it on later. If the download model assumes one continuous stream, every resume story is a retrofit. If the model assumes many independent pieces, resume is what the thing already does.
Chunks are the atom of recovery
Anything above about 10 MB gets split into 1 MB chunks. Each
chunk is its own request with its own Range header,
its own retry budget, its own place in the completion map. Eight
chunks stream in parallel by default. The file you see on disk
is stitched from those chunks when they all land.
The Range header is what makes this work over HTTP.
A chunk isn't a Veloxar-internal concept the server has to know
about. It's just a byte range the server already knows how to
serve.
GET /video.mp4 HTTP/1.1 Host: cdn.example.com Range: bytes=2097152-3145727 HTTP/1.1 206 Partial Content Content-Range: bytes 2097152-3145727/3221225472 Content-Length: 1048576
When the network drops mid-file, Veloxar doesn't have to rebuild
"where we were" by counting bytes written to a stream. It
already knows. Chunks 0 through 1,847 completed, chunks 1,848
and 1,849 were mid-flight, chunks 1,850 through 3,072 never
started. On reconnect, it asks for exactly the ranges it's
missing. The 2.1 GB already on disk stays on disk.
ChunkedDownloadManager is the file where this all
lives. The interesting parts are where individual chunks get
scheduled, cancelled, and retried without the surrounding file
knowing anything happened.
Saving state every two seconds, not every chunk
ChunkStateManager writes the completion map to
disk every 2 seconds. Not after every chunk, because that's too
often for large files where chunks finish in bursts. And not
at the end, because "at the end" is exactly when the process is
most likely to die. Two seconds is the compromise. The worst
case for a crash is that you repeat about two seconds of work.
The subtle thing is what gets saved. It's not "the last byte we wrote." It's the full set of which chunks are done, which are partial, and how far each partial one got. That's the difference between "resume from the start of the current chunk" (which some tools do) and "resume from inside the chunk we were working on." With 1 MB chunks the difference is small per-chunk, but multiplied across eight parallel streams on a slow connection, restarting all eight in-flight chunks is real wasted bandwidth.
The zero-byte file trap
This is the bug that ships in every download tool written in
a hurry. A previous attempt crashed before any bytes were
written. A zero-byte .part file is sitting on
disk. The recovery path sees "file exists" and decides to
resume. It sends Range: bytes=0-. The server
returns 200 OK, full body. The tool appends to the zero-byte
file. Sometimes that's fine. Sometimes the tool's resume logic
decides the existing .part file is authoritative,
truncates the response to "new data only," and you end up with
a file that's smaller than it should be. I've seen variants
where the tool gets stuck in a loop re-resuming zero bytes.
A .part file that's zero bytes is not a resume
candidate. It's garbage. DownloadRecoveryPolicy
treats zero-byte partials as "delete and start over," not
"resume from offset 0." This is the kind of one-line check
that separates download managers that quietly corrupt files
from ones that don't.
Hosts have a reputation
The other half of resume is what you remember between sessions. Most download managers treat every launch as a fresh start. If a host rate-limited you at 8 parallel connections yesterday, you'll happily open 8 parallel connections to it today and eat the 429 all over again.
HostRateLimitTracker persists per-host observations
across sessions. Each 429 steps the host's allowed concurrency
down: 8 to 4, 4 to 2, 2 to 1. A cooldown of 60 seconds gets
added per hit, stacking up to about 300 seconds. Successes
gradually restore the limit back upward. The first download to
a fragile host on a fresh launch starts at whatever that host
has earned, not at the default.
This pairs with resume in a way that only matters when you trust both pieces. Resume gets you back to the byte you left off. Host memory gets you back to the concurrency that host will actually honour, so you don't immediately trip the same rate limit that caused the disconnect in the first place. Without the memory, resume is a loop.
When resume isn't the right answer
The retry and recovery policy has a branch that's as important as the resume logic itself: the branch where it refuses to resume. If a chunk fails with a 404, the URL isn't coming back. If the decoder says the bytes so far are garbage, appending more bytes won't fix them. If the server is explicitly saying 403, retrying at any byte offset is just going to get you blocked harder.
DownloadRecoveryPolicy splits errors the same way
the retry classifier does. A network timeout, a lost connection,
a DNS failure, a 5xx: those are resumable. The partial file
stays, the state file stays, the next attempt picks up where
the last one stopped. A 4xx, a decode failure, a bad URL: those
aren't. The .part file and its state get cleaned
up together, so the next attempt doesn't try to resume from a
corrupt offset and your disk doesn't quietly fill with orphaned
partials.
Chunk-level retry is independent of file-level retry. If one of eight parallel chunks fails with a recoverable error, only that chunk retries, with its own backoff, jitter, and budget. The other seven never notice. It's the same thundering-herd fix from the queue story, applied one level down. The atoms that retry are chunks, not whole files, and they're decorrelated from each other by design.
After the Wi-Fi drops
You start a 3 GB download on flaky hotel Wi-Fi. Two thirds of the way through, the access point drops for 40 seconds (my personal record is longer, and the coffee was terrible). When it comes back, Veloxar picks up within a second or two, streams the chunks it was missing, and finishes. No progress bar snapping back to zero. No "resuming" blinking for a minute while the tool reconstructs state.
A week later you come back to a host that rate-limited you on the last session. The first download opens fewer connections than usual, because the app remembers. It completes faster than it would have if you'd walked into the same 429 wall twice. You don't see any of that either, which is the whole point. Resume and host memory are features you only notice when they aren't there.