Archival Digital Texts
May 6, 2013 9:12 AM   Subscribe

I have inherited my grandmother's writings. I'm scanning them, doing OCR and in some cases retyping them so that I can have them digitized. I'd like to know, is there a format for text that is best for archival purposes? I was thinking of .rtf, since so many applications can open it. Is it considered "archival quality"?
posted by dylan_k to Computers & Internet (15 answers total) 4 users marked this as a favorite
Are you using any of the features of rtf documents -- headers, formatting, what have you?

If not, I don't think anything beats plain text as a format, but you probably want to worry more about what you're storing the files on than how you're storing them. Hard drives are not necessarily going to spin up after you leave them on the shelf for a number of years, and I don't know how long DVDs are expected to last.
posted by katrielalex at 9:18 AM on May 6, 2013 [1 favorite]

Hello! Person with a Master in Archival Studies here. PDF is the go-to format for those of us who want long-term stability/readability. However, plain text will also work well if you don't need to keep the original formatting and just want the words.
posted by jezemars at 9:35 AM on May 6, 2013 [1 favorite]

Librarian here... our digital repository at work uses PDFs
posted by kbuxton at 9:42 AM on May 6, 2013 [1 favorite]

PDF yes, but specifically PDF/A, which is designed precisely for archival preservation. I believe LibreOffice Writer can output PDF/A. "Normal" PDF has various bells and whistles which make it less suited to archiving.

There's no harm in saving as RTF as well, but it's not really designed for this application. And for the real long haul, print them onto acid-free paper -- I won't advise on what kind of printer to use for maximum ink/toner life, because I don't know much about that.
posted by pont at 9:53 AM on May 6, 2013 [1 favorite]

PDF is a bona fide ISO standard these days. The scans should be saved as PDF files with embedded images, and not bare image files. For plain text, you can't beat a .txt file. Even if you want to use the formatting features of RTF, I would probably use text files with Markdown-style emphasis where necessary (basically old-fashioned email or Usenet formatting).
posted by stopgap at 9:56 AM on May 6, 2013 [1 favorite]

I was going to suggest PDF/A as well.
posted by Conrad Cornelius o'Donald o'Dell at 11:08 AM on May 6, 2013

Best answer: Hi. I manage a collection of some 100,000 PDF/As. It is a terrible, terrible format, if you ever want to actually DO something with the text at any time in the future. Because PDFs are a binary format, getting the actual text back out is a real bear. PDFs just aren't up to the task. It does make it real hard to modify the file though, which is a relatively good thing. If you do the PDF/A route, at the least save something else too.

If you need more than straight text, and it doesn't take much to make that happen, then I recommend something using XML, which is still readable by humans using nothing more than a text editor. Specifically, I would recommend the Open Document Format, the format used by Open Office and its spinoffs. .odf files are easy to read if you open up the XML, and follow best practices in a lot of ways- semantic mark-up, keeping formatting separate from the text itself. odf is also an ISO format.

If you don't need something as much as an actual word processor, markdown would be a good choice to add some formatting cues. Another would be to just use HTML, which is basically another branch of the XML lineage.

rtf is a closed-source binary file format. Though it seems ubiquitous, Microsoft could decide to kill it off at any time. In addition, every word processor handles the files in a different way, so formatting is easily lost. On decadal timescales, rtf is a poor choice. There are so many better ways of representing text now, I would not be at all surprised if rtf supprt is quietly deprecated in the fairly near future in most products.
posted by rockindata at 12:18 PM on May 6, 2013 [3 favorites]

Agree - PDF/A is great for storing. The ISO standard is good for future-proofing it and there is a fair amount of 3rd party support for PDF.

Getting text out of it - well, that's an arcane art, which is as easy as the production software was sane. Here's a stack overflow answer that I cobbled together about why writing a PDF editor is non-trivial.

Text extraction, oddly enough, is slightly easier. Slightly. This is why there are fewer tools available that index and search the documents. If you select PDF/A (and honestly, PDF/A is overkill for your task - any tool that generates reasonable PDF that's, say, version 1.4 or earlier and embeds all fonts is going to be close enough), you should ask yourself how important it will be to find things in the document(s).

Otherwise, if you don't care about fomatting at all, just use plain text - something that is trivially readable, trivially searchable, and will work until someone decides to change what a newline character is again.
posted by plinth at 1:11 PM on May 6, 2013

Response by poster: Thank you for all the thoughtful answers, everyone. There were some questions about formatting, and about how I want to use the format. The documents I'm working with were written on a typewriter, so the formatting is minimal, but it is there. I have the occasional bold, underline or italicized word here or there, but they're not using headers, footers, columns or anything too fancy.

I'm hoping to avoid Markdown if possible, simply as a matter of personal preference, but perhaps I should give up on that. Or, what about HTML?

I do like about .rtf or .txt that they are searchable, etc. Once I have compiled all the texts in this format, I can imagine that I might want to do some searches across them, or to use some tools to examine them as a group of texts, etc. It doesn't seem like PDF is as easy to use, for that sort of thing.
posted by dylan_k at 3:26 PM on May 6, 2013

OBNIT: rtf is a closed-source binary file format

Closed-source/proprietary? Yes.
Binary? No
posted by Good Brain at 5:08 PM on May 6, 2013

If you were an institution digitising a collection I would ask the purpose and advise format based on that. Access or preservation? Access = .txt and . PDF Archival quality standard formats would be .TIFF for quality and preservation with an OCRed PDF for access and searchability. You can extract OCR to .txt or .rtf and tidy it up, as it will inevitably be wonky.

For handwriting stick with transcription.
posted by BAKERSFIELD! at 11:39 PM on May 6, 2013

OBNIT: rtf is a closed-source binary file format

Closed-source/proprietary? Yes.
Binary? No

RTF is not binary in the sense that it turns into garbage in a text editor, but it might as well be in that it permits arbitrary binary to be embedded and will allow it in supposedly valid documents. I worked with a "database" of 5700+ human languages that was a giant RTF file and found three different, incompatible forms of escapes mixed in, and there weren't even asian characters in quantity - just lots of IPA and %#$^ Microsoft punctuation.

I think XML is an evil format, but it'd honestly be fine for this (and for search it does look like butterflies and lollipops next to PDF). PDF would be OK if you're only interested in how the documents look rather than what's in them.

If they're typewritten with basically no handwritten notes, I'd just do plaintext/markdown (you can render Markdown to HTML on the fly). If you have things that won't OCR nicely or want to save facsimile quality scans, use JPEG2000 - it's good enough for the Library of Congress event if it's not that widely used yet.
posted by 23 at 2:10 AM on May 7, 2013

Response by poster: The documents I'm working with are typewritten documents: short stories, novels, and the like.

I tried an experiment with the Open Document Format, but was surprised to see that I couldn't open the file in Microsoft Word, even though (I thought) Microsoft claims to be able to open the files.

I may just suck it up and deal with Markdown, but I think that there ought to be a format that presents formatting in a formatted way, right? That's what I like about .rtf: the italics are italicized, the bold is bold, and the text is text rather than a picture of text.
posted by dylan_k at 6:55 PM on May 7, 2013

I think that there ought to be a format that presents formatting in a formatted way, right?

RTF only looks formatted because you open it with an RTF reader, so instead of this:

{\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard
This is some {\b bold} text.\par

You see:

This is some bold text.

The above line also only looks formatted because you're seeing HTML in an HTML viewer. There's not really such a thing as a "Markdown viewer", but that's because Markdown is designed to just turn into HTML - it's intended primarily for writing, not reading, though staying more readable than unrendered HTML is a perk.
posted by 23 at 7:18 PM on May 7, 2013

The idea with Markdown is that you can write it easily in a plain text editor and convert it easily to HTML and any number of other things. But fair enough if you don't like it, it's your project. In that case I'd advise HTML, which is a clean, standardized format.

The question is then how you'd write your HTML. Presumably you'd like a WYSIWYG editor, but you have to be a little careful here: if you write it in, say, Word or LibreOffice and save as HTML, you'll get a valid file but it will probably have a lot of extra hidden "cruft" that will take up extra space (not really a problem) and make things complicated if someone want to process/convert in the future (really a problem). I don't have a personal recommendation here, because I'd do it by writing MarkDown and turning that into HTML. But this thread has some good suggestions.
posted by pont at 2:39 AM on May 8, 2013 [1 favorite]

« Older What should I look for in a Public...   |   If you're bored then you're boring Newer »
This thread is closed to new comments.