Writing fastsync: Why is copying files still so slow, and how I doubled rsync’s speed
I wrote a tool called fastsync when I finally got too frustrated with rsync’s poor local copying performance.
This post goes through the making of the tool, the root cause I found, and why proper cache management is apparently the secret sauce that standard tools seem to miss.
Initial Phase
When I do server tasks, I always calculate a mental baseline. If I need to copy X bytes of data over a local bus, based on the disk speeds, I expect it to take N time units.
N hours later... wtf. Why is this only 50% done?
An initial and immediate iostat analysis revealed the ugly truth: I was getting less than 50% utilization on the disks. Thanks to rsync.
Astonishment
I couldn’t believe it. I am sure someone has thought about copying files efficiently before. I can't be the first sysadmin to want to push two disks to their limits. So why is it so slow?
The observed problem in a traditional copy loop looks like this:
1. Pick a file
2. Read a chunk
3. Transfer / process
4. Write a chunk
5. Wait for sync / completion
This causes a clear read/write turn-taking behavior. It blocks efficient parallel copying, and leaves resources underutilized. It’s like those old single-threaded programs: read something, fetch from the network, process, write... and you look at top seeing "100% utilization" of a single thread, while the disk is at 0.1%, the network at 0.5%, and the CPU is just sitting in a wait state. We are creating an artificial bottleneck.
The Fix: First Attempt
I decided to write an alternate program in Python that decoupled these steps efficiently. I built a pipeline where:
- A dedicated thread reads
- A queue transfers data
- A dedicated thread writes
All happening independently, with a configurable block size and a bounded transfer queue depth. I also used a temporary file for writing, moving it into place via an atomic os.replace() at the end, so broken files never end up in the destination.
Second Astonishment
I did the full decoupling. Why was it still slow? This was confusing.
I experimented and increased the transfer queue depth massively - basically buffering the whole file in RAM (up to 1 GiB).
Boom. Now it worked as expected. 100% utilization. The disks were reading and writing concurrently.
But why? Increasing the transfer buffer to absurd sizes shouldn't be the right way to fix a pipeline. This technically worked, and the initial problem was solved, but it kept nagging me. Why did it require such a huge queue depth to prevent the pipeline from stalling?
The Next Day: The Secret Sauce
The next day, based on my experience with generic system optimizations, I decided to tackle cache management.
I strongly suspected the stalling was related to the kernel’s delayed allocation and fsync combo. Delayed allocation gathers dirty pages in RAM, and then the filesystem or the flush daemon forces a massive wait until that data is physically committed to the disk, blocking other I/O.
So, I added explicit posix_fadvise() hints to my Python script. This is the secret sauce that rsync is NOT(?) doing by default for large sequential transfers:
_fadvise(src_fd, 0, 0, 'POSIX_FADV_SEQUENTIAL')
_fadvise(src_fd, 0, 0, 'POSIX_FADV_NOREUSE')
_fadvise(dst_fd, 0, 0, 'POSIX_FADV_SEQUENTIAL')
_fadvise(dst_fd, 0, 0, 'POSIX_FADV_NOREUSE')
Note: I also implemented a rolling POSIX_FADV_DONTNEED window trailing behind the read/write heads to drop page cache progressively, rather than waiting for EOF.
I tested it with the large queue. It worked.
Then, I dropped the queue depth back down to something sane - like 4 blocks of 1 MiB.
It still worked. Perfectly.
This confirmed my theory. The root cause of the slowness was a lack of explicit cache management. By calling posix_fadvise and managing the cache properly, the kernel stops pausing the pipeline to flush massive dirty page buffers. We bypassed the latency-inducing I/O operations. This doubled the performance, cut the transfer time in half, and shrank my required memory buffer from 1 GiB down to 8 MiB.
It sometimes feels like this goes into the category of "why my Python scripts are often faster than Java, .NET, or C++ programs". It’s not because the language is faster, it's because the program just does things smarter. I can't believe there are still improvements this big just sitting there, seemingly unnoticed by mainstream tools.
Factors and Addressing the Skeptics
When I posted about this on the Fediverse (Mastodon/Pleroma), a few people were understandably skeptical. “Rsync has been maintained by experienced system programmers for 30 years... are you sure you just beat them with fadvise hints?”
They asked excellent questions about bottlenecks, workloads, and environments. So, to be clear, "fastsync is 2x faster" applies because my environment perfectly aligned with these factors:
- Local I/O Bottleneck: I am not bottlenecked by a WAN or Gigabit LAN. The source and destination systems have roughly the same storage speed. If you are network-bound, local cache hints mean nothing.
- SMR Backup Drives: My destination was an SMR (Shingled Magnetic Recording) drive for backups. SMR drives punish you severely if you write chaotically or trigger deep flush stalls. And yet, the exact same mechanics apply to SSDs and CMR/PMR drives as well, SMR drives and large datasets simply made the pain acute enough that I finally sat down to find the root cause.
- Filesystem: Both filesystems use btrfs.
- Preallocation: I use os.posix_fallocate() before writing, and also run rsync with --preallocate, so this is a level playing field between the two tools.
- Whole-file mode: Transfers are always --whole-file replacements, entirely bypassing block-level delta checksumming overhead.
- High RAM: The system has plenty of RAM, meaning without fadvise, the kernel would happily cache massive amounts of dirty pages before hitting a wall and stalling the pipeline.
By giving the kernel NOREUSE and DONTNEED hints, we prevent that dirty-page wall. We feed the disk at a steady, streaming pace, much like how we used to feed CD burners back in the day to avoid buffer underruns.
Open Questions
I know the cache management solved the performance problem. I didn't bother to dig into the deepest kernel source code on this, mostly because I'm sure there are smarter people reading this who know the Linux VFS and btrfs subsystems much better than I do.
So my questions to you:
1. Why exactly does the combination of SEQUENTIAL, NOREUSE, and progressive DONTNEED smooth out the writes so perfectly compared to the default kernel heuristics?
2. Yes, I am aware there are probably sysctl tweaks (dirty_ratio, dirty_background_ratio) that could have globally improved the situation. But why should a sysadmin have to tune global kernel memory parameters just to copy a large file efficiently without trashing the system cache?
If you have insights into the exact mechanism of why this works so dramatically well, let me know.
2026-06-09