Today in Tedium: Last week, in my attempts to optimize my laptop for my needs, I did something I had never done before: I copied my entire Linux home folder from one drive to another. Initially, I started on one distro (Nobara), but then started using another (Bazzite) on my laptop’s secondary SSD. I liked Nobara, but after I tried Bazzite, it sucked me in like a round of Balatro. I realized that Bazzite was now my main tool, but my other drive, which was larger, was still set up in Nobara. How’d I fix it? With a little rsync, of course. After setting up the right command-line flags, the process took around three hours and moved hundreds of gigabytes of data even across two NVMe SSDs. That’s even considering that each drive averages around 6,000 gigabytes per second in the best of conditions. It was disruptive to my workflow, but it got basically everything. The result, with its logs fluttering past my face at a speed of around a thousand files per second, mesmerized me, and it got me thinking about how copying files has evolved over the years. Today’s Tedium ponders file copying across drives and time. — Ernie @ Tedium
“The rsync algorithm efficiently computes which parts of a source file match some part of an existing destination file. These parts need not be sent across the link; all that is needed is a reference to the part of the destination file. Only parts of the source file which are not matched in this way need to be sent verbatim. The receiver can then construct a copy of the source file using the references to parts of the existing destination file and the verbatim material.”
— Andrew Tridgell and Paul Mackerras of the Australian National University, writing in a 1996 academic paper about the algorithm rsync uses. It eventually evolved into an extremely popular way to sync data across folders. In layperson’s terms, the tool compares the shape of the files on two different machines, and only updates changed data when necessary, rather than downloading and comparing the whole file, something other file-copying tools of its era did. The piece of software became one of the most popular in the open-source ecosystem, along with a bedrock part of file copying, though plenty have raised concerns about its age, end-user complexity, and technical flaws.
The roots of modern file-copying go all the way back to the PDP-6
As you might imagine, the ability to copy files and folders within a directory structure existed from almost the beginning of the history of computing. One of the first examples was the Peripheral Interchange Program, a piece of software dating to 1963. First developed for Digital Equipment Corporation’s PDP-6, an early part of its long-running Programmed Data Processor line of computers, the device has an interesting pedigree.
Its developer, Harrison “Dit” Morse, previously gained notoriety for taking part in a documentary called The Thinking Machine. The novel pitch of the hour-long series, dating to 1961, was that a computer could develop a television Western—roughly 60 years before ChatGPT made it possible to do so easily. (Friend of the newsletter Harry McCracken wrote about this ahead-of-its-time clip for Fast Company back in 2019.)
Dit Morse wrote the software that produced the western, called “Saga.”
Impressive in its own right, sure, but ultimately what Dit worked on at DEC ultimately did more to shape how we use computers. (Apologies to the LLM nuts out there.)
A 1984 retelling of PIP’s creation—shared in the newsgroup alt.sys.pdp10 in 1995 by early Computer History Museum member Jack H. Stevens and later collected in its archives—puts it like this:
PIP was invented by “Dit” Morse as a demonstration of device independence. Its original name was ATLATL, which stood for “Anything, Lord, to Anything, Lord”. This was appropriate, as it took a certain amount of prayer to get anything to move between media. In those days, when TTY’s had back-arrows (instead of underscores) that key was used instead of the equals sign in PIP. This, it was felt, was sufficiently obvious that anyone who, for example, tried to read from the line printer got a message like: “You gnerd, device LPT: can’t do input!” That message was changed the day after Ken Olsen tried out ATLATL.
The evolution of ATLATL, literally implying a leap of faith to trust that data could be transferred between components, reflects a lack of trust in the process—especially given that said copying was happening to magnetic tape, drum memory and hard drive platters bigger than your head.
Solving this problem was essential for mainstreaming computers in the business world. (That said, if you still get nervous about running a command like this in a terminal, I get it.)
Over time, DEC’s early software designs inspired their microcomputer successors. The highly influential CP/M, developed by computing icon and Computer Chronicles co-host Gary Kildall, didn’t have a copy command. Instead, it used PIP—reflecting Kildall’s background with DEC.
A more interactive “copy” command probably would have been more obvious to an average person, but as Michael Swaine wrote in Dr. Dobb’s Journal in 1997, it reflected his approach to development and entrepreneurship:
[Digital Research employee] Alan Cooper blames Gary. When anyone would tell Gary that he ought to add a particular feature, “Gary would try to argue you out of it.” He didn’t want to pollute good code with kludged-on features. The PIP command exemplified his attitude. In CP/M, you “Pipped” to drive B from drive A; in MS-DOS, you “Copied” from A to B. Gary thought that there was nothing wrong with using the command PIP to copy, and that any halfway intelligent person could master the concept that you copied (or pipped) from right to left. Bill Gates let people do it the way they wanted. “That difference in attitude,” Cooper says, “is worth twenty million dollars.” Gary didn’t care. What Gary was interested in was inventing.
So no, while CP/M directly inspired (some might say “was ripped off by”) Microsoft’s DOS, one thing that didn’t get copied over was PIP, which was instead named COPY.
On the Unix side of things, which bled through to MacOS, FreeBSD, and Linux, the cp command came around in 1971, alongside the mv command, which moved files from one location to another, and ln, which created links between files in the UNIX file system. Despite the relative simplicity of the tools, the concerns raised by the DEC’s early “ATLATL” nomenclature proved true. “All too often mistyped lists clobbered precious files,” a 1987 backgrounder on Unix explained. It turns out, it took a few tries to get things right:
What seems natural in hindsight was not clear-cut at the time: the final conventions arose only after long discussions about how properly to handle file permissions and multiple files.
The general confusion around copying created a natural on-ramp for the graphical user interface. Think of it this way: Sorting a stack of files is a clear thing to most people; a command-line interface is not. But the command line still has its benefits: Give a GUI too many files, and its file-copying simplicity falls apart. If you’re moving 100,000 files on the regular, the GUI often gets in the way.
Which brings us back to command-line tools like “cp” or “rsync.” These tools are all quite old at this point, and despite periodic improvements, we have found better tactics to do the things they do. But in some ways, a “cp” or “mv” command is like a “cd” command. For moving around files locally, it doesn’t need to be reinvented. Don’t overthink your bash script.
Want a byte-sized version of Hacker News? Try TLDR’s free daily newsletter.
TLDR covers the most interesting tech, science, and coding news in just 5 minutes.
No sports, politics, or weather.
Five iconic types of programs specifically used for copying and backing up large amounts of files
- XCOPY. Dating to 1986, this copy command, standing for “extended copy,” is to MS-DOS and Windows what rsync is to Unix-based systems—more robust than the option most people use, but these days a bit old hat, in part because of limitations its 39-year-old implementation didn’t account for. It’s long been superseded by robocopy, which is a more robust implementation of the same idea.
- Carbon Copy Cloner. This MacOS-based tool is an excellent graphical choice if your goal is to create a wholesale bootable drive on MacOS. It was my tool of choice for Hackintoshing for many years. The only problem? Apple periodically changes things, causing the developer to have to fix the tool every couple of years.
- Dropbox. There’s an infamous comment on Hacker News that emerged upon the launch of Dropbox, perhaps the first “modern” cloud app targeted at file syncing, which has become the modern backup technique of choice. Key line: “For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.” No wonder people liked it. It wasn’t originally built for backups—instead, as a replacement for sharing data between machines—but it turned into a backup system nonetheless.
- Time Machine. Apple’s tool for backing up your files with versioning aims for showiness that rarely appears in backup tools. The tool’s approach to highlighting snapshots feels almost gamified. But despite this, Time Machine—first launching in 2007 and replacing a less-fancy backup app—has become a bedrock part of the MacOS experience. Other backup solutions have more knobs and deeper feature sets, but Time Machine simply looks cooler.
- rclone. If rsync is like the Unix cp tool with batteries, rclone is rsync with batteries, built specifically for cloud-based tasks, such as backups to an Amazon S3 server. One key benefit that it has over rsync is that it has multiprocessor support, something rsync does not. I mentioned it in my piece about trying to find a Dropbox replacement.
RAID is one of the most important computing concepts you barely understand
Of course, programs that copy data are just one piece of the pie. There was also the idea of simply baking redundancy and constant iterative backups into the process of operating a computer, something that was especially important for high-end computing use cases like server rooms. Problem was, as the need for parity and data security increased, the technology was actually becoming less sophisticated, at least from a storage standpoint.
That’s the dilemma that RAID, a term often believed to be an acronym for “Redundant Arrays of Inexpensive Disks,” was invented to solve. In a nutshell, RAID relies on the fact that cheap storage is plentiful and easily accessible, then simply puts more of it in place, which ensured redundancy was being used for critical data. Given that cheaper storage meant less reliable storage at the time, this gave companies a way to use less-expensive storage while mitigating its downsides.
The idea was introduced to the public in an 1988 academic paper, “A Case for Redundant Arrays of Inexpensive Disks.” The lead author on the paper, David A. Patterson of the University of California, Berkeley, already had quite the pedigree. He was one of the academics who identified the concept of RISC processing in the late 1970s. RAID is nearly as impressive. At the time, the market for data storage was evolving, but not as quickly as other parts of computing, such as processing capabilities. Data integrity mattered even as data storage needs continued to grow, but the path to more data was going to be on a less-expensive, less-reliable road.
The paper poses this as a problem needing a solution. And that solution was RAID.
“To overcome the reliability challenge, we must make use of extra disks containing redundant information to recover the original information when a disk fails,” wrote Patterson and his co-authors, Garth Gibson and Randy H. Katz.
The rub of all this is that building redundant data is complicated, because there are seemingly endless ways to split up storage on cheap hardware. The paper lays out five of them, with varying levels of speed, error correction, and efficiency. There are some limitations (for one thing, if you suffer some sort of emergency or natural disaster, having a machine that protects the data from corruption only does so much).
“RAIDs offer a cost-effective option to meet the challenge of exponential growth in the processor and memory speeds,” the paper adds. “We believe the size reduction of personal computer disks is a key to the success of disk arrays, just as Gordon Bell argues that the size reduction of microprocessors is a key to the success in multiprocessors.”
It turns out, there are many ways to slice up multiple disks, and none of them are created equal.
Example: A couple of years ago, Apple upset some of its fans by kneecapping its lowest-end machines—the M2 MacBook Air, M2 MacBook Pro, and M2 Mac Mini—with a single NAND chip, rather than the two that the higher-end models use. Essentially, part of the reason that modern SSDs are fast comes down to their RAID 0-style setups, which divides the data across multiple drives without additional data redundancy. Essentially, RAID 0 creates two lanes of traffic for storing data, rather than just one, which eases congestion and gets your data to its destination more quickly. However, if one of the SSD chips wears out, the entire drive is toast. If you build the SSD with just one chip, there goes half the performance—a problem that arguably made the base M2 a worse deal than the base M1 it replaced.
But RAID isn’t just a way to speed up data—it’s also a way to protect data from corruption, an important use case in cloud and server environments. RAID 3 and RAID 4, for example, utilize a dedicated parity disk to build redundancy. Meanwhile, RAID 5 distributes parity across the drives, an approach that allows you to pull out one of the drives while without losing access to the data. Put simply, the extra work means you’re not getting the blazing speeds of RAID 0 on these, but it reflects that the goal may not be speed.
You may just want your data to stick around for the next decade.
Rsync is an immensely fascinating program, because it is inherently a simple one. Having been in regular use for nearly 30 years, it is a technology cockroach, one that exists in myriad settings due to its high flexibility. It’s not a tool you use to copy over a single file—it’s a tool you use to distribute a lot of files, where the standard checks of a cp command or a little files-app trickery might get in the way. (Copying files on MacOS with the limited data Apple gives you about their progress? The worst.)
It can be used for local transfers and remote backups alike, and it is considered a great low-overhead option when you want a close replication of a folder.
Like many command-line tools, it’s deceptively simple. Sure, it can copy files, but its real skill is fairly deep syntax and its ability to be used endless ways. The internet is full of blog posts about people trying to optimize rsync for their use cases.
But its age, ubiquity, and elder-statesman status means it is not perfect—far from it. And that has created opportunities for competing tools built for more sophisticated use cases to replace it. The company Resilio, whose product Resilio Sync is clearly named to evoke the command-line tool, even though it uses the competing BitTorrent protocol, lays down the truth like this: “Rsync is a Linux-based tool that was invented back in the 1990’s, when file sizes and systems were small. As such, it’s an aging technology that doesn’t perform well in larger, modern replication environments.”
(In its defense, it’s only five years older than BitTorrent, but BitTorrent was built around a different approach to file transfers and with a lot less legacy.)
Recently, a series of exploits affecting the software were exposed, with one exploit getting a 9.8 out of 10 on the severity scale. Essentially, nearly every Linux distro and many software projects are likely affected by the issues, which can cause remote code execution or file leaks.
I wasn’t trying to use it remotely, to be clear. Rather, I just wanted to copy a home folder, including the dotfiles, from one drive to another. The operating system is immutable, so I didn’t need to touch the innards of my Bazzite install. (I think I had to copy the WiFi logins over, but that was it.)
The first time I did it—hopping into the command-line TTY interface to avoid file conflicts—it came up with some errors. I fell asleep before it could complete. But the second time, it worked just fine, even if I had to work on a backup machine for a couple of hours.
Honestly, I found the endless logs of file changes inspiring, in its own way.
--
Find this one an interesting read? Share it with a pal! And back with another one in a couple of days.