Writing fastsync: Why is copying files still so slow, and how I doubled rsync’s speed

I wrote a tool called fastsync when I finally got too frustrated with rsync’s poor local copying performance.

This post goes through the making of the tool, the root cause I found, and why proper cache management is apparently the secret sauce that standard tools seem to miss.

Initial Phase

When I do server tasks, I always calculate a mental baseline. If I need to copy X bytes of data over a local bus, based on the disk speeds, I expect it to take N time units.

N hours later... wtf. Why is this only 50% done?

An initial and immediate iostat analysis revealed the ugly truth: I was getting less than 50% utilization on the disks. Thanks to rsync.

Astonishment

I couldn’t believe it. I am sure someone has thought about copying files efficiently before. I can't be the first sysadmin to want to push two disks to their limits. So why is it so slow?

The observed problem in a traditional copy loop looks like this:

1. Pick a file

2. Read a chunk

3. Transfer / process

4. Write a chunk

5. Wait for sync / completion

This causes a clear read/write turn-taking behavior. It blocks efficient parallel copying, and leaves resources underutilized. It’s like those old single-threaded programs: read something, fetch from the network, process, write... and you look at top seeing "100% utilization" of a single thread, while the disk is at 0.1%, the network at 0.5%, and the CPU is just sitting in a wait state. We are creating an artificial bottleneck.

The Fix: First Attempt

I decided to write an alternate program in Python that decoupled these steps efficiently. I built a pipeline where:

- A dedicated thread reads

- A queue transfers data

- A dedicated thread writes

All happening independently, with a configurable block size and a bounded transfer queue depth. I also used a temporary file for writing, moving it into place via an atomic os.replace() at the end, so broken files never end up in the destination.

Second Astonishment

I did the full decoupling. Why was it still slow? This was confusing.

I experimented and increased the transfer queue depth massively - basically buffering the whole file in RAM (up to 1 GiB).

Boom. Now it worked as expected. 100% utilization. The disks were reading and writing concurrently.

But why? Increasing the transfer buffer to absurd sizes shouldn't be the right way to fix a pipeline. This technically worked, and the initial problem was solved, but it kept nagging me. Why did it require such a huge queue depth to prevent the pipeline from stalling?

The Next Day: The Secret Sauce

The next day, based on my experience with generic system optimizations, I decided to tackle cache management.

I strongly suspected the stalling was related to the kernel’s delayed allocation and fsync combo. Delayed allocation gathers dirty pages in RAM, and then the filesystem or the flush daemon forces a massive wait until that data is physically committed to the disk, blocking other I/O.

So, I added explicit posix_fadvise() hints to my Python script. This is the secret sauce that rsync is NOT(?) doing by default for large sequential transfers:

_fadvise(src_fd, 0, 0, 'POSIX_FADV_SEQUENTIAL')

_fadvise(src_fd, 0, 0, 'POSIX_FADV_NOREUSE')

_fadvise(dst_fd, 0, 0, 'POSIX_FADV_SEQUENTIAL')

_fadvise(dst_fd, 0, 0, 'POSIX_FADV_NOREUSE')

Note: I also implemented a rolling POSIX_FADV_DONTNEED window trailing behind the read/write heads to drop page cache progressively, rather than waiting for EOF.

I tested it with the large queue. It worked.

Then, I dropped the queue depth back down to something sane - like 4 blocks of 1 MiB.

It still worked. Perfectly.

This confirmed my theory. The root cause of the slowness was a lack of explicit cache management. By calling posix_fadvise and managing the cache properly, the kernel stops pausing the pipeline to flush massive dirty page buffers. We bypassed the latency-inducing I/O operations. This doubled the performance, cut the transfer time in half, and shrank my required memory buffer from 1 GiB down to 8 MiB.

It sometimes feels like this goes into the category of "why my Python scripts are often faster than Java, .NET, or C++ programs". It’s not because the language is faster, it's because the program just does things smarter. I can't believe there are still improvements this big just sitting there, seemingly unnoticed by mainstream tools.

Factors and Addressing the Skeptics

When I posted about this on the Fediverse (Mastodon/Pleroma), a few people were understandably skeptical. “Rsync has been maintained by experienced system programmers for 30 years... are you sure you just beat them with fadvise hints?”

They asked excellent questions about bottlenecks, workloads, and environments. So, to be clear, "fastsync is 2x faster" applies because my environment perfectly aligned with these factors:

- Local I/O Bottleneck: I am not bottlenecked by a WAN or Gigabit LAN. The source and destination systems have roughly the same storage speed. If you are network-bound, local cache hints mean nothing.

- SMR Backup Drives: My destination was an SMR (Shingled Magnetic Recording) drive for backups. SMR drives punish you severely if you write chaotically or trigger deep flush stalls. And yet, the exact same mechanics apply to SSDs and CMR/PMR drives as well, SMR drives and large datasets simply made the pain acute enough that I finally sat down to find the root cause.

- Filesystem: Both filesystems use btrfs.

- Preallocation: I use os.posix_fallocate() before writing, and also run rsync with --preallocate, so this is a level playing field between the two tools.

- Whole-file mode: Transfers are always --whole-file replacements, entirely bypassing block-level delta checksumming overhead.

- High RAM: The system has plenty of RAM, meaning without fadvise, the kernel would happily cache massive amounts of dirty pages before hitting a wall and stalling the pipeline.

By giving the kernel NOREUSE and DONTNEED hints, we prevent that dirty-page wall. We feed the disk at a steady, streaming pace, much like how we used to feed CD burners back in the day to avoid buffer underruns.

Open Questions

I know the cache management solved the performance problem. I didn't bother to dig into the deepest kernel source code on this, mostly because I'm sure there are smarter people reading this who know the Linux VFS and btrfs subsystems much better than I do.

So my questions to you:

1. Why exactly does the combination of SEQUENTIAL, NOREUSE, and progressive DONTNEED smooth out the writes so perfectly compared to the default kernel heuristics?

2. Yes, I am aware there are probably sysctl tweaks (dirty_ratio, dirty_background_ratio) that could have globally improved the situation. But why should a sysadmin have to tune global kernel memory parameters just to copy a large file efficiently without trashing the system cache?

If you have insights into the exact mechanism of why this works so dramatically well, let me know.

2026-06-09

Answers (?) by AI

Gemini 3.1 Pro said...

Here is the technical breakdown of exactly why your manual cache management outperforms the default Linux Virtual File System (VFS) heuristics in this scenario.

Core Mechanism: Defeating the "Dirty Page Wall"

The default kernel behavior is optimized for general-purpose workloads (bursty, repeatedly accessed data), not large, single-pass sequential streams.

- Default Heuristic (Stop-and-Go I/O): When copying a file natively, the kernel utilizes write-back caching. It buffers writes in RAM as dirty pages. The VFS waits to flush these to disk until the system hits the global vm.dirty_background_ratio. If the I/O bus is slower than the CPU/RAM (which it always is), the dirty pages keep accumulating until they hit vm.dirty_ratio. At this exact moment, the kernel blocks the writing process entirely to force synchronous flushing. This creates a severe start-stop pipeline: burst writes to RAM, stall completely while disk thrashes, repeat.

- Your Heuristic (Streaming I/O): By using a trailing window of cache invalidation, you bypass the global vm.*_ratio thresholds entirely. You are forcing the kernel into a steady "conveyor belt" model, pushing I/O to the block layer at the exact speed the hardware can accept it.

The Role of Each Hint

- POSIX_FADV_SEQUENTIAL:

- Reads: It doubles the default read-ahead window size. This ensures the source drive is constantly pre-fetching blocks into RAM before your script asks for them, preventing read-side starvation.

- Writes: It gives the block I/O scheduler a clear hint to merge I/O requests sequentially, minimizing mechanical seeks (crucial for HDDs).

- POSIX_FADV_NOREUSE:

- Signals that the data will be accessed exactly once. Note: In the Linux kernel, NOREUSE historically falls back to doing very little natively (often mapping to WILLNEED), but conceptually it tells the VFS not to prioritize keeping these pages in the active LRU (Least Recently Used) list.

- POSIX_FADV_DONTNEED (Progressive):

- The heavy hitter. By progressively calling this on the trailing window behind your read/write head, you explicitly evict pages from the page cache.

- For writes, triggering DONTNEED on dirty pages forces the kernel to initiate writeback for that specific range immediately and then drop the pages. This entirely prevents the accumulation of the massive dirty page buffer. Your script effectively manages the buffer bloat instead of letting the kernel guess when to flush.

Why the Effect is Amplified on Btrfs and SMR

While these mechanics apply to generic Ext4/SSD setups, your specific environment turns a sub-optimal process into a catastrophic one without these hints.

- Btrfs (Copy-on-Write): Btrfs has heavy metadata overhead. When the kernel flushes a massive 1 GiB chunk of dirty pages all at once (default heuristic), Btrfs must calculate checksums, allocate new extents (CoW), and update the B-tree metadata for that massive chunk synchronously. This causes extreme CPU and I/O latency spikes. A steady stream spreads this overhead out evenly.

- SMR (Shingled Magnetic Recording): SMR drives have a small PMR/CMR landing zone cache. If you dump a massive burst of dirty pages to the drive, that cache instantly overflows. The drive must then pause to perform brutal read-modify-write cycles directly on the overlapping shingled tracks. By forcing a smooth, controlled, sequential stream using fadvise, you allow the drive's internal firmware to stream writes directly to the shingled zones efficiently without overflowing the CMR cache.

By actively micro-managing the page cache lifecycle, your fastsync script prevents the kernel from outsmarting itself and turning a simple sequential stream into a complex, blocking memory-management task.

2026-06-09