
Over the history of mankind, the best way found to archive data was to carve it into stone, then bury it in the sand. Photographically, the most stable form of archiving is probably a black-and-white silver-based image on a glass plate. For digital data storage, there is no perfect permanent storage option. Most digital storage media can’t be confidently recommended to be dependable beyond 5-10 years.
So clearly, our standard backup procedure (two backups) on hard disk, solid-state disk, optical disks, or magnetic tape is not the whole job. It’s not enough to make our backups and just store the copies in a box. Many experts recommend checking our archived media once a year since any kind of commercially available (and affordable) storage gradually degrades over time whether or not it is in use.
Even right after our backup copy is made, can we assume it’s valid? The date, time, and size of the file copy may match the original, but do we know that the data in the files on the backups are any good? The worst time to find out that the backups are no good is when you need to retrieve or restore some files!
If you are thinking that looking at every image or video in every backup you have made once a year seems impractical, you’re right! Using a program to compare two copies of a file byte by byte is much better. Using a smarter program to compare all the files in a backup with the original or second copy is even better, but think back to how long it takes to just copy terabytes of files from one disk to another — probably minutes to hours on a fast system or perhaps over a day on a slow network. And remember that comparing copies of files takes about the same amount of time since the computer must read two large files to check one against the other.
Cutting the Effort in Half With a Digest
An alternative that cuts the checking process in half is to generate a message digest, also known as a cryptographic hash in computer terminology. What this does is to take an entire file and use a special scrambling procedure on it to output a short (32-128 bytes) digest. The scrambling is not random. The algorithm for the scrambling is specified exactly, so the same digest will always be calculated for a file. For the purposes of validating backups, the shortest standard digest is 32 bytes (128 bits) long and is designated MD5 (message digest version 5). The scrambling algorithm is designed in such a way that changing a single bit in the input drastically changes the digest.
You’ll often see the MD5 (or longer SHA) digests alongside files for download over the internet so that your copy of the downloaded file can be verified to be exactly the same as the copy on the server providing the file. The procedure is for you to download the file and generate the digest on your downloaded copy, which can then be compared to the digest published on the server. If they match, you can rest assured that you’ve successfully downloaded a clean copy without corruption by the communication channel.
It should be obvious that for an arbitrary file to be squeezed down to a 128-bit digest does not generate a digest that is unique to a single input file, but the chance of two files (or an original and corrupted version of a file) having the same digest is small enough to ignore so we can use this property to check the validity of a copy by generating its digest and comparing it with the digest of the original file. The practical significance of this is that you don’t need the original file in order to verify that the copy is accurate, and you save the time that would have been required to read the original too.
Processing Backups
As noted earlier, there are longer digests that have been standardized, but the computation of these digests takes more CPU time, so the MD5 digest is better for use in verifying large collections of files. The general procedure for implementing this is:
- Start at the top-level directory of your backup set (e.g. example directory 2020 for a set of photos).
- Generate the MD5 digest for all files in directory 2020 and subdirectories, saving the list in a text file such as md5.txt.
- Copy the directory (and subdirectories) along with the md5.txt file to the backup drive.
Once this is done, the backup disk copy can be verified by running an MD5 verification by reading the backup files and comparing the MD5 digests to what is contained in the md5.txt file. If they all match, you can be reasonably sure that the copy has been created without corruption. Note that this works for all files, regardless of their contents, so you can include your word-processed notes or spreadsheets that go along with the photos or videos you are archiving.
Software
While many programs exist to generate the MD5 digest of a file, most of these work only on a single file. It turns out that the best (and fastest) way to batch process all files in a backup is to use a set of relatively simple programs which already exist in a Linux/Unix terminal environment. While this may sound primitive, this does have the advantage of running in a variety of operating systems since many platforms also support Linux. Windows has its WSL (Windows Subsystem for Linux), higher-end X86 Chromebooks support installing Linux, and Apple systems already have a Unix variant underneath its GUI interface. Note: I have not personally tested this on an Apple computer.
Once a Linux terminal window is available, generate the MD5 digest on the original files (an example is the 2020 directory):
- $ cd 2020 # make the top-level directory the current directory
- $ find -type f -exec md5sum “{}” + | tee md5.txt
For convenience, put the command line into a shell script to avoid typing it each time (If this is confusing, get your local computer guru to set you up).
Note that “$” indicates the terminal prompt. The programs “find”, “md5sum”, and “tee” are installed by default on Linux systems. The “find” searches for all files at and below the current directory and passes the list to “md5sum,” the heart of the procedure, which generates the MD5 digest for each file. The MD5 digests are collected in the output file called md5.txt (placed in the current directory), which I’ve used here as a default but can be customized to whatever name desired. The “tee” is included so that the files will be shown on your screen as well as written to the file md5.txt as they are processed so you can get some feedback showing your computer is actually working on the task.
Once the file tree has been copied to a backup destination (say, /Backup), just go into the backup directory and verify the copied files:
- $ cd /Backup/2020 # make the backup directory the current directory
- $ md5sum -c md5.txt | tee md5.log
This will cause md5sum to read each file and MD5 digest in md5.txt, generate the digest for the file, compare it to the digest in the file, and output “OK” or “FAILED.” These are output to the screen as they are processed as well as copied to the text file md5.log. This file can be opened in any text editor and searched for the “FAIL” string to see if any of the files failed the verification. Note that the file md5.txt will appear to have failed if it existed previously since it is actually created and is changing during the MD5 generation procedure.
Put Your Old Computer(s) to Work
Both the initial generation of the MD5 digest list and verification on a copy can take hours if you have a large number of files. But the good news is that this doesn’t have to tie up your main computer. Backup disks can be taken to another computer system and verified without needing access to the master set of files. Put the second computer (and other idle computers) to work while you go on with normal work on your main computer. These other computers also don’t have to be running the same operating system as long as they have a Linux terminal capability, since the md5sum program generates the same MD5 digest on all systems.
If we follow the recommendations of archival experts, we should do this every year on all of our backup copies to allow us to sleep better at night. This procedure also annually reads the contents of all of the files on the disks, which also gives the firmware controlling the hard disk or SSD a regular chance to verify that all is well with the underlying hardware and recording medium. If errors are found in the reading process, the firmware will attempt to relocate the files out of failing areas of the media or in the worst case, alert us to the failure of the drive.