In this piece, we’re focusing on archiving and we’re using a Mac. We will be discussing:
- File and media naming schemes
- The use of a cataloguing app
- File types to archive
- Simplifying the archiving process
- Local or the cloud?
Archiving images, video footage and audio is harder than archiving documents and email. Why? Because you can search through documents and email using your OS’s built-in search feature, but you can’t search through images, video and audio without first adding metadata. Whereas for text-based files you can theoretically manage everything from the Finder, you’ll definitely need a cataloguing system to avoid a mess when it comes to audio, video and images.
A cataloguing system (CS) can be a spreadsheet or a dedicated app such as NeoFinder, Portfolio, KeyFlow, Photo Mechanic, DiskCatalogMaker or DEVONthink Pro — or several different ones simultaneously used, one for every file type.
A CS makes searching through archives easy and efficient. It also facilitates searching data starting from the data carrier — the media. An example: let’s say you have collected close to 600 optical discs ranging from 4.7GB to 100GB per disc. You know from experience you can still read even the oldest media (if those are archival-quality discs or M-DISCs, you will ) with an external OWC Blu-Ray reader/writer, but you cannot make out what’s on them from reading the label.
As many discs contain more than one data type, you can’t tell which files are on each disc. You can, of course, name your media with a description of what’s on it, but if you collect files of different content types on the same media to save space and money, the naming scheme can quickly become very large. So, what is good practice for naming files and media?
Implementing a file naming protocol is essential for successful discovery and retrieval of objects in a collection of data. Consistency is paramount in that sense and it’s easy to implement: you settle for a naming scheme and apply it with rigour and over your entire career as a creator of digital files — text, images, video clips, audio…
File names should be as short as possible. True, macOS allows you to use elaborate file names, but that doesn’t mean it’s efficient. That’s why you’ll see some advice below mentioning acronyms. Names can have mixed case but some people advise to convert all characters to lower case or upper case.
Here is some information that you should consider including in your file names:
- Creation date — despite macOS supporting searches for creation dates, starting with dates allows you to easily sort files and virtually group them by that date quickly.
- Project name or work title — use an acronym if you can or at the very least avoid redundancy.
- Publication or output channel.
- Creator’s initials — e.g. if you’re managing a group effort like an editor of a magazine.
- Date range that applies to the project (optional; depends on your workflow and practice).
- Type of data — are these text-only, text with images, ProRes video? A commercial, tutorial, review, commentary…?
Here is how your name is built up using all the elements from above:
- YYYY-MM-DD or YYYMMDD. It’s best to start with the year as that is the longest active component of your date/time element. You can add a time in HHMM format if you wish – the level of granularity depends on your work. You can add an underscore or a dot, or just continue adding elements without.
- Imagine a title that reads like this: “Archiving images, video footage and audio on a Mac — best practices”. Shorten it to “archiv.img.av.mac.best.pract”.
- A freelance writer might use a self-invented acronym for each of his publications. A freelance documentary maker can use the acronym for the TV-station that commissioned the work. If it’s a project created for a company, shorten it to its official abbreviation or invent one that you will remember years from now.
- E.g. “John Charles O’Rourke” could become “JCOR”.
- MMYY-MMYY. The first date is the start date, the second the end date.
- Use abbreviations of the different types.
- Obviously, only when you need it.
A good idea is to create a ReadMe file that explains your naming convention along with abbreviations you use and add it to the archival media holding the actual files.
An example file name consisting of 62 characters: 20200524archiv.img.av.mac.best.pract.RSN.EV.PRORES422HQ.v3.tut.mp4
Naming media is less clearcut than file naming. As I said before, media can hold multiple categories of data. For example, you can archive texts on various subjects and images on one single media. In such cases, you cannot use the naming scheme for files one-on-one.
We can make good use of library sciences here by implementing part of the Universal Decimal Classification or the Library of Congress categories to name and label our media.
A library classification is a system of knowledge organisation by which library resources are arranged and ordered systematically. Library classification systems group related materials together in a hierarchical tree structure.
For example, let’s assume we are archiving a folder with invoices and a bunch of — not necessarily related — documentary videoclips. We can either use the names of the classification entries or look up the UDC codes for these broadly defined topics on the Summary webpage here.
Let’s say we will be using the codes instead of the names. We will end up with codes like these:
- 657 or 347.7 for Accountancy.
- 77.03 for Documentary within the Photography category.
The first topic, accountancy, has two possible UDC entries that either relate to accountancy as an act of commercial trade or a legal activity. Unless you’re a perfectionist, it doesn’t really matter which of these codes is 100% accurate. If it describes the topic on the media well enough, it’s OK.
The most important, however, is that you choose one option and stick to it. Also, it’s a good idea to keep the list of possible codes as short as possible. Write it down somewhere, so you can consult it.
The media with the files in the example can now be named 657+791.6, together with the archival date. This turns out to the media being named 20200524_657+77.03.
Why would you use the codes instead of the names? The answer is that the names can become quite lengthy and therefore possibly unpractical for entering into the search field of a CS. The name above would be 20200524_accountancy+documentary. That’s still manageable, but add a third category and it will become quite long.
An alternative is to name the media whatever you prefer and add this to a metadata field in your CS. Extensive descriptive metadata about a digital object helps to minimise the risks of it becoming inaccessible, but the disadvantage is that if you switch CSs at some point, the metadata field may not exist or exist in a different form, which risks you losing it altogether.
The ultimate purpose of the naming scheme is, however, to find the topics of the files on the media. The advantage of using UDC is that it is standardised and a Controlled Vocabulary (in this case, in coded format), and that names can be short and still contain a well-defined topic range. Alternatively, freelance workers and small offices can do what large corporations often do, which is to create your own Controlled Vocabulary.
Finally, to complete the media naming scheme, add a date range if needed. And what if you have media that contain the exact same topics filed on the same date? Well, in that case, you can just add a serial number. Our example then becomes 20200524_657+77.03#2.
In large, usually robotised archival systems, the management software has a front-end that allows users to search for files by several criteria, including extensive metadata collections added to a sidecar file that is associated with the file.
On a Mac, macOS supports extended metadata that is visible and usable by the user. On all macOS versions that have the Spotlight search engine an extended metadata set is available, but you need an app such as HoudahSpot to discover it. Using HoudahSpot on archival media, however, is not an option as Spotlight needs to first index your files which can take ages if your archive is stored on optical media.
A better way to access metadata is to use a dedicated management system. Some of these are all-round tools while some are more focused on visuals like images, graphics and audiovisual data. Most of them are designed to manage large numbers of files, often capable of low-res previews of documents/images/footage on offline media such as DVDs and Blu-Rays and server-like storage like cloud servers and NAS or SAN systems.
A good CS should offer extensive metadata creation and editing functionality and can be used as a central hub to all media and files you’ve ever created or stored, including both online and offline files. They create pointers to files and low-resolution representations of visual content across different storage media.
On macOS, there are a few CS’s available and which one you’ll be using is entirely up to you. My advice, though, is to use a specific file-type based CS as much as possible because of the specialised features it will offer. For example, Capture One Pro offers a complete cataloguing system with ratings for images. You can use Capture One Pro (review) for online as well as offline images. KeyFlow 2 does the same for videoclips and is particularly well-suited for Final Cut Pro X users.
A general purpose CS is NeoFinder (review), which is superior to DiskCatalogMaker because of its more extensive support for metadata. It lacks functionality in that it will not, for example, allow you to select metadata fields that are invisible in the Finder, nor will it allow you to create your own.
A good solution is to use DEVONthink Pro (review), the freeform database that has many features for managing large collections of data and media. DEVONthink has a feature that indexes files no matter where they are located. In that respect it functions much like a CS. It also supports the creation of metadata.
None of these solutions is ideal, certainly not in the perspective of building a long-term archive. The best solution is to use a Digital Asset Management system, but those are usually cloud or server-based, aimed at brand management on an enterprise scale and/or quite expensive.
At the very least, you will not be able to read all your current file types in ten years from now. That is not always due to them not surviving. Sometimes you may have moved on to using different software and the app that you used to create them doesn’t support them.
For archiving, we need to be sure we will be able to read files long after they’ve been created. We can’t say much about the future but we can look at the past and then extrapolate towards 20, 30 years from now. I’m at an age where I can look back on an active life of 25+ years, so I have some experience with what I can still read today that existed back then and what I can’t.
Adobe’s Digital Negative (DNG) is still very much alive and kicking. So are JPEGs and TIFFs. That longevity is nowhere to be found in the camera RAW file format. To give you an example, some 15 years ago I had a Kodak mirrorless “DSLR” that hasn’t been on the market for very long. It output images in a RAW format not a single image editor can read today. I was lucky enough (not clever enough, mind you) to have saved those images as DNGs.
If you think that only happens with low-end formats like that of the Kodak camera, think again. My Hasselblad files from a review unit dating back some 20 years is no longer supported by most of the image editors. Capture One Pro is an exception and, I hope for its users, the Hasselblad editing app another.
In short, store your images in RAW to retain the highest possible quality in case they’ll still be readable in three decades or more, but make sure they are also converted to DNG so you’re not going to depend on software developers’ whims to include your old camera RAW format.
Vector graphics are still around, even the Adobe Illustrator format from 30 years ago is still readable. If you want to make absolutely sure you’ll be able to read your vector graphics 20 years from now, save them in PDF format.
Text-only, markdown and pure HTML
No problem here as text-only is always going to be readable, provided the file extension isn’t some exotic invention of one developer. Even then, you will be able to open the file using a text editor like BBEdit and extracting the content.
Markdown is a text file with some additional, simple coding inside. Files in this format can be read as text-only files. There’s one caveat. I’m writing this with Ulysses (review), my favourite markdown/text editor. Ulysses used to only support its full feature set when using its sheets container that lives deep inside the current macOS’s User Library. If you archive Ulysses sheets, you will either have to archive the sheets inside that system library or export all of them to text files or PDFs.
Text from Microsoft Word, Apple Pages and your favourite layout app
Microsoft Word’s files are really XML containers with your text buried inside. In theory, you will always be able to open the container and pick the text from the file within. In practice, you can’t tell if you’re going to be able to open that container file in the future. Apple Pages files more or less have the same internal structure. Both developers’ files can only be read by the originating app.
That also goes for Adobe InDesign, Affinity Publisher and all other apps that add some kind of formatting to the text and page that is embedded in the file. In general, every application that has to print a file with a specific look and design in terms of page layout — and that includes Word and Pages — will hardcode the design and the text in the same file or bundle.
The solution for archiving is PDF, specifically PDF X/4 which is a true archival format. PDF has been used since I started writing some 27 years ago and it has only become more of a standard (ISO and all) over time. PDF is a very sure bet for archival purposes.
From Adobe’s website: PDF/X, PDF/E, and PDF/A standards are defined by the International Organisation for Standardisation (ISO). PDF/X standards apply to graphic content exchange; PDF/E standards apply to the interactive exchange of engineering documents; PDF/A standards apply to long-term archiving of electronic documents. During PDF conversion, the file that is being processed is checked against the specified standard. If the PDF does not meet the selected ISO standard, you are prompted to either cancel the conversion or create a non-compliant file.
The most widely used standards for a print publishing workflow are several PDF/X formats: PDF/X‑1a, PDF/X‑3, and (in 2008) PDF/X‑4. The most widely used standards for PDF archiving are PDF/A‑1a and PDF/A‑1b (for less stringent requirements). Currently, the only version of PDF/E is PDF/E-1.
Sound and audio should definitely not be archived as MP3 files. Try formats such as WAV, AIFF, FLAC and AAC instead. These file formats in general take up more space than other, more compressed formats, but it’s the only way to store them with the sound quality at the time of recording and there’s a very good chance that you will be able to read them 30 years from now as some formats (FLAC and AAC) are open source managed.
The surefire way to archive sound is to record them to separate media and create a repeating reminder every five years or so to check if the format is still alive.
The problem with audio is more or less also the problem with video. Ideally, you shouldn’t store video in compressed formats. Video archival is a complex subject and for long-term preservation it’s a good idea to see what museums and libraries are doing. A web page dedicated to this subject can be found here.
Just as with audio, refreshing the data at regular intervals and storing multiple copies on different media types (in different geographical locations if possible) is best.
Archival codecs are preferably open source — if you want to check which codecs and audio formats are open source, look here — and possibly some well-maintained proprietary ones such as ProRes, DNxHD and Cineform.
Email is a problem by itself. The message paradigm in digital form today may not be the one of tomorrow. To make sure you can read messages and replies from today in the future, you should archive messages to a text-only format on a regular basis.
On the Mac, I have always found EagleFiler a good solution; better even than DEVONthink Pro. DEVONthink can archive email messages but takes a long time to go through all of them if your mailbox has over 5000 messages.
EagleFiler takes much less time initially and indexes the messages at a later stage. It saves messages to a bundle that you can navigate internally to find each message as an XML envelope with its content.
Dedicated mail archival apps exist and they make more sense, provided the apps will be maintained in the future. Two apps are well-known in this market: MailSteward and Mail Archiver X. I once had the pleasure of reviewing them and the experience was a mixed bag.
In many large companies, archival follows strict rules governed by ISO regulations. Formalising the archival steps into a written guide simplifies the process because that enables users — or you — to start naming files in your daily practice with an eye on the archival process, regularly save text-based files to PDF, maintain the database, and ultimately offload to archival media.
Stick to one file naming system throughout and use the same one forever. For example, as a writer I will stick to this scheme:
YYYMMDD_title.of.work.publicationabbrieviation.review(or tutorial, backgrounder,news,whatever).numberofwords.
- Date of archiving_
- Working title of my article
- Abbreviation of the first publication running the article
- Type of article, e.g. a review, tutorial, tips & tricks, background article, news item…
- Number of words.
For archival media, use the scheme in chapter 3. For example, a DVD containing video, audio and the related invoices from 2000 to 2004 will look like this:
- Date of archiving_
- UDC codes of topics, concatenated
- _date range (only if needed)
- Serial if more than one topic is repeated.
You can make archiving easier if you add barcodes to your media labelling process besides the textual description you’re bound to use. Barcodes are symbols that represent numeric, alphanumeric or mixed data in a graphic format. For example, a barcode on the back of a book is usually the graphic representation of the ISBN number that identifies the book.
As media names can’t be 100% descriptive and intelligible to everyone in an organisation, barcodes offer direct, quick and easy access to the cataloguing app that holds their file index. You can even program some barcode scanners like the Honeywell Voyager XP 1472g (review) to launch the app or bring it to the front and automatically input the barcode character string in the search field.
You can convert media names or metadata to barcodes by entering the alphanumeric/numeric data in a dedicated app like Barcode Producer or Barcode Studio (review), or you can use Tec-It’s online barcode generator.
The benefits of using barcodes are numerous:
- If you are a member of a group, barcodes avoid errors due to other people not being able to read your handwritten label.
- Combined with a simple check-in/check-out system, barcodes allow for track & trace of media.
- A media label with a barcode isn’t readable without a scanner. If you use an exotic barcode symbology, it is a first, albeit weak, defence against prying eyes.
- It saves time when you need to quickly know what’s on the media. You just scan it with the cursor in the search field of your CS and the media is returned as a result.
- Barcode scanners don’t know typos.
It has drawbacks too, of course. For starters, you need to create the barcode and buy a scanner. With Tec-It’s online barcode generator or its wonderful Barcode Studio app (85 Euros), you can create alphanumeric 2D barcodes almost no scanner can read (think Dotcode and MaxiCode) — adding another modest privacy protection layer — and that support a lot of characters so you can be as descriptive as you like.
A good barcode scanner like the Honeywell Voyager 1472g XP 2D barcode scanner will cost you around 100 Euros but in return you’ll get a scanner that scans every known barcode, even when it’s not perfectly readable or damaged, that knows how to scan in batch, that you can program in different ways and that is very robust. The Voyager XP 1472g scans Dotcode and MaxiCode symbologies.
Cataloguing System or Database
On the Mac, DEVONthink Pro is the best general-purpose CS available in my opinion. It supports custom metadata — you can create your own, including barcode fields, categories, all kinds of references and lists, etc.
In addition, DEVONthink Pro allows you to index media the same way as DiskCatalogMaker or NeoFinder and adds the ability to link and interconnect your media and the files on them with other data in its databases. It indexes rather fast and supports videoclips, images, text and a whole slew of other formats retaining the look and feel of many of them in the indexed previews.
And finally, it has a hugely powerful search feature that offers unparalleled capabilities to find data across different media and file formats.
You have many types of computer users, but in general we tend to differentiate between individual users and enterprise users. The former usually prefer the comfort of and external hard disk drive or consumer-oriented cloud service to archive data.
Hard disk drives are indeed faster than tape and optical media but they are only reliable for a short while and data can suffer from magnetic degradation. Personal experience has taught me the hard way that disk drives are only good for backing up, not for archiving. I had two disks that I kept in a box stored at over two meters from the nearest magnetic field and they both failed. In my case, it wasn’t magnetic degradation that killed them; it was mechanical malfunction, perhaps due to lying idle for over three years.
I should have refreshed the data every year or so, but data refreshing apps aren’t available for the Mac, as far as I know.
If hard disk drives are a no-go, then perhaps the SSD is a good idea? I think not. Sure, they are robust and virtually shock-proof, but the type of memory they’re based on won’t hold data for extended periods of time. The cells leak over time. SSDs are also too new to tell how long exactly an SSD will retain data when stored unpowered.
Magnetic tape might be enticing as it is still used by large companies but it has its drawbacks. It can stretch and break, and be magnetically erased. And it’s expensive. The tape drives themselves are not robust and as data is stored sequentially, retrieving is excruciatingly slow.
Optical media, then? Well, they’re my favourite storage media and my experience with high-quality DVDs and Blu-Ray discs has been very positive. I keep my media in the dark (literally), protected from heat — although it does regularly get warm at over 30 degrees Celsius in my storage room during the summer — and stored upright on their edge.
I have DVD-Rs dating back from when I started writing. All of those are Verbatim’s best-rated Archival Discs. The past three years, I have archived onto Verbatim’s M-DISC Blu-ray discs. Those are rated for 1,000 years, but even my old DVDs are still in perfect condition and I can read them as if it were yesterday. M-DISCs are currently the most hardiest discs available and they are available in 25GB, 50GB, and 100GB Blu-Ray versions.
The downside to archiving on consumer-level optical media is twofold. For starters, it is a relatively slow process. A 100GB M-DISC takes about one-and-a-half hour before writing is complete and then another 15 minutes to verify. The second disadvantage is that you’ll need an optical drive capable of writing to an M-DISC, which currently means you’re stuck with an LG BH16NS40 or compatible model. The highest capacity M-DISCs are also quite expensive.
For video and film producers, the best alternative to any of the above and to cloud services is the Sony Optical Disc Archive, which is a cartridge with 11 Blu-Ray discs inside, optimised for quick, random read/write performance combined with high capacities of 5.5TB. The Total Cost of Ownership (TCO) of a Sony ODA is lower than a cloud solution and can easily earn back itself within two years for a PetaSite library (that holds petabytes) and an even shorter time for an individual disc reader/writer that costs in the region of €900 with a cartridge costing something like €185. The period depends on the amount of storage — the higher, the faster the investment pays itself back.
Sony ODA is used by large companies like Endemol Shine, Golf Channel and even freelance cameramen who archive their footage on ODA cartridges because they are robust, have a high lifespan of 100 years and retrieval is fast with a typical read speed of under 2 seconds, while Sony delivers a specialised DAM app, Content Manager, for the ODA drive that allows users to manage their files.
What about the cloud? The cloud is a no-go for long-term archiving. Sure, it’s easy, convenient and there are some very cheap online storage services but the drawbacks are considerable. Your data ends up on hard drives and you will have to pay a monthly fee and, in some cases, transfer charges. Unless you can negotiate a good Quality of Service (QoS) agreement for a low price, cloud archiving is a costly affair. Low-cost, high QoS level combinations do not exist — you always get what you pay for.
Then there’s the ISP who is essentially your man in the middle and Single Point of Failure. In essence, you’re at the mercy of your ISP’s online connection speed and availability. Privacy and security concerns are also an issue.