One pipeline, not two | Veloxar Blog

We used to have two code paths for adding URLs -- one for the clipboard watcher, one for everywhere else. They drifted. Here's the refactor that collapsed them into a single extraction step, and the duplicate-row bug that forced us to do it.

Fast-path extensions

0.5s

Clipboard poll

100

Buffered URLs

Two paths, one bug

For most of Veloxar's life, "add a URL to the queue" had two implementations. One lived inside ClipboardMonitor. It polled NSPasteboard, pulled out anything that looked like a link, built its own item, and shoved it into the link collector through a private method. The other lived in LinkCollectorService.addURLs(_:), which is what drag-and-drop, the manual paste box, and the browser extension all used. Same job, two code paths. You can probably guess how this ends.

They started out similar and drifted, the way these things do. The canonical path learned to strip zero-width characters; the clipboard path didn't. The canonical path pre-classified direct file URLs into the fast lane; the clipboard path sent everything through generic framework detection. The canonical path deduped on a normalized form. The clipboard path compared raw strings.

The user-visible result was what we started calling the placeholder leak. If the clipboard watcher was on and you pasted a link to a page that needed a crawler, the link showed up twice. Once as a placeholder row the clipboard path added for the original page URL, and once as a fully resolved row the canonical pipeline added a second later for the extracted media URL. The first row never cleaned itself up because no one owned it. It just sat there, a ghost. I stared at those logs for longer than I want to admit before I realized the problem wasn't the dedup check -- it was that the two paths had quietly stopped agreeing on what "the same URL" meant.

Collapsing the two paths

The fix wasn't a feature, it was a delete. We tore out the clipboard path's private add method and replaced it with one entry point on the link collector: processText(_:). It takes arbitrary text (whatever was on the clipboard, the contents of a dropped .txt file, a wall of links pasted into the UI) and runs it through one extraction step, then hands the rest off to addURLs.

before:  clipboard  -> extractURLs_v1 -> addItems (private)
         paste/drop -> extractURLs_v2 -> addURLs

after:   clipboard  -+
         paste/drop -+-> processText -> addURLs
         file drop  -+

The structural change matters more than the line count. Before, there were two places that knew how to turn text into URLs and two places that knew how to turn URLs into download rows. Now there's one of each. A new edge case in extraction lands in one file and fixes the bug in every entry point at the same time.

Extraction itself runs in two stages. First we push the text through Foundation's NSDataDetector, which catches well-formed HTTP(S) URLs. Then a small set of regex patterns picks up hosters whose share links don't look like ordinary web URLs: Dropbox share tokens, Mediafire shortlinks, MEGA fragment links, a few others. We split it this way because NSDataDetector is fast and correct for the 95% case but has no idea how file hosters mint URLs.

The invisible character problem

Half of the original duplicate bugs came from content that looked identical to the eye but wasn't identical to a string comparison. Text copied out of a chat app, a web page, or a PDF tends to arrive with invisible hitchhikers: zero-width spaces (U+200B), byte-order marks, mismatched line endings, stray non-breaking spaces. You don't see them. Strict parsers do.

A URL with a leading U+200B is not a URL as far as URL(string:) is concerned. The h in https isn't where it should be, so extraction either fails outright or produces something subtly wrong that then fails to match the dedup cache. The first version of the clipboard path just handed whatever it got to String's URL initializer and shrugged when the same link occasionally produced two rows.

The content-cleaning pass strips U+200B, BOM, and NBSP before anything else runs. It sounds like a nitpick; it was the single biggest source of duplicate-row reports in the old architecture. Two identical URLs pasted from two different sources were not identical bytes, so the dedup check missed and both made it through to the queue.

Line endings get the same treatment. A text file dropped from Windows arrives with \r\n. One from an older macOS source can still surprise you with a bare \r. Most modern stuff is \n. Split on one delimiter and ignore the others and you end up with "URLs" that have a trailing carriage return glued to them, which is how you wake up to another duplicate-row bug. Normalize once at the top of processText and the rest of the pipeline gets to assume clean input.

Confidence, not certainty

The clipboard is a noisy input. People copy URLs, but they also copy sentences that happen to contain a URL, and sometimes they copy a chunk of HTML with five URLs in it where they only meant one. We didn't want the watcher jumping on every stray match, so every extracted URL gets a confidence score in [0, 1].

The scoring is deliberately dull. Start at 0.5. Bump it up for URLs that match a known hoster pattern, bump it up again for direct file extensions, drop it for suspiciously short URLs or ones that look like fragments. Anything below 0.3 is dropped. Anything above 0.8, in the watcher's aggressive mode, auto-adds.

The score isn't trying to be clever. It's a single knob for "how twitchy should the clipboard watcher be?" Users who want it to grab everything move the threshold down. Users who only want it reacting to obvious media links move it up. Without the score we'd be writing if-ladders forever, trying to guess intent from the shape of a string.

The 48-extension fast path

Once processText has emitted URLs and handed them to addURLs, the pipeline has to decide what kind of link each one is. Most URLs need a full framework-detection pass: fetch the page, check what CMS or player is running, pick a plugin, let the plugin extract media. That's expensive, and for a big chunk of real-world links it's also unnecessary.

A direct .mp4 doesn't need a plugin. Neither does a .zip, a .pdf, an .m3u8, or any of the 45 other direct-file extensions we've collected in URLClassification. Before the dispatcher calls out to the framework system, it checks the URL's path extension against that list. If it matches, the URL skips framework detection entirely and goes straight to the downloader.

The savings are per-URL and modest, but they compound. A user pastes a wall of 40 direct CDN links; the old path would have fetched 40 pages and run each one through every registered crawler's pattern match before realizing they were all static files. The fast path turns that into 40 trivial extension checks and gets straight to work.

After the refactor

If the refactor worked, you notice nothing, which is both the honest answer and the goal. Paste a link, one row appears. Paste the same link twice, still one row. Copy a link to your clipboard while Veloxar is open and the watcher picks it up, and you get the same row you'd have gotten from dropping it on the window. No ghost placeholders next to the real entry, no two dedup implementations silently disagreeing about whether two strings are the same URL.

The lesson isn't about clipboards or URL parsing. Two code paths doing the same job will drift, and the drift will eventually surface as a user-visible bug nobody can reproduce on the first try. The cheap fix is to make them match. The real fix is to delete one of them.