Anyone still speak 2-D barcodese?
September 22, 2007 9:02 AM   Subscribe

Help me solve the mystery of an obscure 2D barcode, indexing historically valuable documents from the Vietnam war!

[apologies in advance for the length here, this gets complicated.]

During the Vietnam war the US side started a unit to exploit information from documents captured from enemy soldiers. This unit, (the Captured Document Exploitation Division, CDEC), compiled somewhere around 3 million pages of documents, estimated at containing 500K distinct "files" over the course of the war. Before the documents were turned over to the S. Vietnamese, they were captured on film, actually on 35mm movie reels, one frame per page. This vast amount of film is now in the hands of the U.S. National Archives, but there's no *useable* index for it.

Now here's the crux of the mystery: the 35mm movie reels were "indexed" using a special one-of-a-kind machine (built by a defunct California company called "FMA FileSearch"), which encoded indexing information into the *audio track* of the 35mm film, in the form of these 2D barcode-ey things.

Long story short, the machine that read this barcode was left in Vietnam, and no working replica has been located or fashioned ever since. No user manual for the machine exists either.

Now here's the crucial clue I happened upon: on the original reels containing the 35mm film, there is the FMA FileSearch logo, containing an image of the same barcode. I'm thinking that this 8x8 2-dimensional barcode/grid captures the name of the company (either FMA or FileSearch, or both?) and therefore can be used as a reference to decode how the barcode-ey thing contains its information.

But, alas, I'm lacking in the quantitative and old-time-computing experience to determine *how* exactly that company logo barcode says anything.

Am I wrong to assume it encodes some information in binary form? How much information could be packed into an 8x8 2-dimensional grid like this? I don't think there's a need for a "directional" reference in the grid, since the whole thing was designed to sit on a film strip with a known directional orientation. How can I go about decoding these index codes?

[For more on the history of the CDEC unit, and the document collection, here's a link to the Archives website (search on CDEC). Here's the only info I could find on the FMA FileSearch gizmo (p. 536 of this pdf). Haven't located any Patents that might correlate with this code yet either.]
posted by garfy3 to Technology (29 answers total) 15 users marked this as a favorite
 
Here's what you do. Research records that will get you the name of the company and details about it. Check DoD contract records for the contract in question. Then check the Securities and Exchange Commisions filings for the company for the names of persons who worked for the company, assuming it is not privately held. Try and get a name and then use a commerical search service to find out the name of someone still alive associated with the company and see what they would know about technical specs. The patents are usually too general to be useful. You need a person who knows any details about the sytem or where documents are.
posted by Ironmouth at 9:19 AM on September 22, 2007


An 8 by 8 binary grid could hold, in theory 2^(8*8) or 2^64 or a really freaking huge number of unique combinations.

The thing is, it's highly unlikely the barcodes actually hold information. Barcodes typically represent an arbitrary serial number that can be used to look up and store associated data in a database.

Then again, it's possible they encode some intrinsic information, but it would have to be every short, or use a very limited character set (it's 8 bytes so maybe 8 ASCII characters? That assumes no bits are used for checksumming).

For a modern example of this sort of physical data storage with a 2D barcode see Datamatrix
posted by phrontist at 9:25 AM on September 22, 2007


Response by poster: phrontist I was afraid of that, that the information embedded in the code blocks just refers to a numbered list of "keywords" that existed only on paper. I've found nothing close to such a listing thus far, which does not bode well. I guess I'm just hoping that the symbol in the company logo offers a glimmer of a clue...
posted by garfy3 at 9:34 AM on September 22, 2007


Looking closer at the flickr photo pool, I'd be willing to bet they are just a sequential numbering scheme, as you guess. Would that be useful at all? It seems like they probably (at least intended) to do annotations that could be looked up by these serial numbers. Perhaps they still exist? Otherwise you'll just end up with numbers, I think.

To figure out the binary scheme you just need to look at a few of them, and if they are sequential, the bits will change in a regular pattern. The lowest order bit will transition every single time. The next one up will transition every other time. The third order bit will transition every fourth time, and so forth, up the powers of two. If some bits don't seem to follow a regular state change pattern, they are probably checksum bits, calculated using the state of all the others.
posted by phrontist at 9:38 AM on September 22, 2007


If you post a few sequential ones I'd be willing to figure out the scheme for you. Email me if and when you do.
posted by phrontist at 9:40 AM on September 22, 2007


Response by poster: well, holy google, i just found the patent for the machine. it references an earlier patent application for the coding scheme, which doesn't seem to be available in the google patent world. hrrmmm..

Patent
posted by garfy3 at 9:48 AM on September 22, 2007


Back in the 70s, the public library where I lived used a 2d (magnetic I think) barcode pasted into the backs of books. It was about 8x8 (ie could instead have been 6x10 - my memory is vague).

It might be the case that the FMA machine was only custom gear because it needed to work with film, and was using a standard archiving code system.

The public library system I remember was a paper sheet about 7x12cm glued into the inside back of the book, with a grid of holes about 4mm diameter, through which an underlayer of foil could be seen. When I say "grid of holes", I mean the barcode - half the "holes" were not there, but those holes that were there lined up on the grid. The machine did not need contact with the face of the sheet to "read" the book's identification.

I second that the code most likely only contains serial numbers for the following reasons:
- For a barcode in that era, 8 bits per character seems overly wasteful. 6 bits gives 64 characters (ie 26 letters, 10 numbers, some misc stuff), but doesn't divide evenly into the 8x8 grid - there are 4 bits left over. Maybe 7 bits (thus giving space for uppercase and lowercase) and a check bit?
- "FMA" would presumably only take up 3 of the 8 (or more) characters the code can contain, so you would expect to see a repeating bit pattern for most of the logo indicating the remaining characters are all the same (ie blank).

However, this all goes out the window if the machine was using some kind of compression algorithm, but that's way out of my league if it was. If compression is involved, then (full circle) it might be easier to look for similar archiving codes in the hope that they're using a standard code.
posted by -harlequin- at 10:04 AM on September 22, 2007


harlequin: Compression does not make sense in this case. There is a certain overhead (space consumption wise) to any compression algorithm - a minimum length under which compression hurts more than it helps. 64 bits is very, very tiny - I highly doubt it would be employed here.
posted by phrontist at 10:53 AM on September 22, 2007


If it's true (as the first flickr caption suggests) that the machine could search a reel by keyword, then that makes me think that the barcode does encode keywords rather than just sequential serial numbers. If it were serial numbers, there would have to be a second database listing the keywords for each serial-- which isn't impossible, and is the way I'd do it today-- but at least there's a chance. However, the keywords themselves would have serial numbers rather then being encoded text. Illustrating with a random subject area, the keyword codes might be
1: left hand
2: right hand
3: left foot
4: right foot
etc.
and a frame of film would be coded 3 and 4 if it pertained to both feet.

The most obvious system for an 8x8 grid would be a set of 256 possible keywords, and up to 8 keywords applied to each frame. If one were to go through a few hundred frames and come up with potential keywords for them while converting the barcodes into sets of numbers, and then compare the keyword list to the sets of numbers, it might be possible to find a correlation.

It's also possible that the barcode has both a serial number and a smaller set of keyword numbers.

This is a really interesting problem, and I hope you'll keep us updated about your progress!
posted by moonmilk at 11:09 AM on September 22, 2007


Response by poster: From the patent I linked above, it appears that setting the 2D barcode initially, as well as "searching" for a match for it later, was one using a punchcard that the machine would read. So, to amplify phrontist's concerns, the punchcards used to initially encode keywords may have reflected some numerical list.

One interesting sidenote--an Archivist once told me that the subject classifications used in the index were probably derived from a WWII set of classifications called the Intelligence Subject Codes (ISCs or something). Perhaps moonmilk is onto something with tracking the coded numbers and the obvious subjects of the documents. Well, with the added difficulty that the documents are entirely *in Vietnamese*. Gah.

Perhaps I can put my hands on that original patent application for the coding scheme, will have to make some calls over to the USPTO...
posted by garfy3 at 11:23 AM on September 22, 2007


CDEC actually = "Combined Documents Exploitation Center." Searching on this and combining it with terms such as 'microfilm' pulls up some interesting results here and here. It looks like others are trying to crack this as well.

Anyway I'm guessing (like others) that the barcodes are serial numbers that refer to indexes elsewhere. E.g. you would look up "Ho Chi Minh" and get a list of serial numbers - e.g. off the top of my head 123ABC-ED1, 234HGF-TY6 - and you would then punch these into the Filesearch machine, which would then set up some photo-electric switches to scan through the film as it was being wound through, stopping at the appropriate document.

Interesting that the index codes may have been around since WW2; the technology certainly has.
posted by carter at 11:41 AM on September 22, 2007


Here's a link to a page that links to a lot of pdfs of what I assume may be examples of some of the original indexes. They all appear to be of the form nn-nnn(n)-nn, which seems to refer to '2 digit month of translation'-number of document-'2 digit year of translation.'

E.g. the first document listed on this page, which has the index # 09-1935-66, was translated/indexed in September 1966.
posted by carter at 12:02 PM on September 22, 2007


The basic structure of the barcode seems pretty straightforward, just guessing— the strip of slightly narrower dots at the top is the clock/index signal, and each word of data extends perpendicular to that strip. The long horizontal line is probably also some sort of marker rather than data. So each word would consist of two 7-bit fields, or maybe two 8-bit fields if the eighth bit happens to be the same for all the values in that photo. Hand-decoding that photo just produces garble in ASCII or EBCDIC ... I like moonmilk's theory, that the barcodes list values from some limited vocabulary of keywords. Perhaps the operator would look up the relevant numbers in a book kept by the reader.

Some ambiguity is which end of the word is the more-significant bit, and whether 0 is represented by a dark square or a light square.
posted by hattifattener at 1:08 PM on September 22, 2007


Ah, the patent you link to describes the layout, starting at line 58 of column 4. The data is in fact two fields of 7 bits each, plus timing and registration marks of various sorts.

Patent 3342978 looks like it's the corresponding patent for the reader device.
posted by hattifattener at 1:52 PM on September 22, 2007


This is a lovely puzzle.

I really tried to make sense of it. But I couldn't. Anyways, here is a text version of the binary, in case someone smarter than me wants to give it a shot.

Assuming these are bunch of 7 bit fields, there are four ways to interpret the binary:
1) White represents 1, most significant bit is at the top (in the image)
2) White represents 0, most significant bit is at the top
3) White represents 1, most significant bit is at the bottom
4) White represents 0, most significant bit is at the bottom

Here are these variations, and their corresponding decimal values after the slash:
Case 1:

Upper
1010100/84
0000001/1
0001101/13
0000001/1
1011101/93
0001110/14
0000001/1
0001110/14
1011101/93

Lower
1100100/100
0000001/1
0000001/1
0000001/1
1101000/104
0001011/11
0000001/1
0000010/2
1110110/118

Case 2:

Upper
0101011/43
1111110/126
1110010/114
1111110/126
0100010/34
1110001/113
1111110/126
1110001/113
0100010/34

Lower
0011011/27
1111110/126
1111110/126
1111110/126
0010111/23
1110100/116
1111110/126
1111101/125
0001001/9

Case 3:

Upper
0010101/21
1000000/64
1011000/88
1000000/64
1011101/93
0111000/56
1000000/64
0111000/56
1011101/93

Lower
0010011/19
1000000/64
1000000/64
1000000/64
0001011/11
1101000/104
1000000/64
0100000/32
0110111/55

Case 4:

Upper
1101010/106
0111111/63
0100111/39
0111111/63
0100010/34
1000111/71
0111111/63
1000111/71
0100010/34

Lower
1101100/108
0111111/63
0111111/63
0111111/63
1110100/116
0010111/23
0111111/63
1011111/95
1001000/72


I can't make anything out of this. My head hurts and I'm going to bed.
posted by cheerleaders_to_your_funeral at 3:26 PM on September 22, 2007


This may or may not be of any help at all to you, but here goes: I once worked with a very primitive sorting system that used punched-cards for Boolean keyword sorting. The mechanics of this were very low-tech, but I wonder if it's similar in theory to how the 'barcodes' work.

The cards I used had holes punched near the edge, and the area between the hole and the edge of the card was perforated so that if you put something (a pencil, whatever) into the hole and tugged upwards, the perforation would tear out, leaving a notch in the card.

You assigned each notch on the card to some category, and if you wanted to assign a card to a particular category (or multiple categories), you tore out the perforations on all the holes EXCEPT the ones you wanted to use. In this way, if you had a stack of cards and you wanted to select one particular category, you could take a bit of wire, stick it through the hole for that category, and gently lift out only the cards assigned to that category. (And if you used two wires you effectively had an AND search, or if you used the cards that fell out when you lifted, you did a NOT search.)

I tend to wonder if the bar code isn't an electromechanical extension of the same basic sorting theory. With an 8x8 grid you have a very large number of possible 'categories' or 'keywords.' (Using not just one space on the grid per category, but also using certain combinations within the grid for categories.) Presumably in order to search, you had to have access to a key, some list that contained all the categories/keywords, and that let you set the machine up with the ones you wanted it to search for.

Since you don't have the key, the barcodes would just look arbitrary. They wouldn't look like sequential numbers, but just random. Maybe, the key to unraveling them is to start looking at documents that are similar in certain ways; ones that relate to certain events, anything that makes sense for someone to have tagged similarly. Then, look at the barcodes for those documents and see if there are any patterns that keep turning up.

I'm not sure how you could automate this process. If you can get good scans of all, or a large number of, the barcodes, you could find some way of turning them into numerical data and performing statistical and cryptanalysis on them. (There is quite a community of amateur cryptographers out there, Elonka Dunin arguably chief among them; she might be able to put you in touch with people who would welcome a challenge).
posted by Kadin2048 at 11:01 PM on September 22, 2007


Well, you piqued my curiosity, so I did a little digging.

Running a search for "FileSearch" in the USPTO database turned up a hit, which gave me the company's name and address:

FMA, INC.
142 NEVADA ST. EL SEGUNDO CALIFORNIA

Searching that address is a dead end; it's now an Alfa Romeo repair shop.

However, searching "FMA El Segundo" turns up some interesting things. Apparently the company didn't die after the FileSearch, they went on to make a machine called the 'Rapid Selector' and sold it to the Federal government; it got used by a few agencies.

If you search for various combinations of FMA and 'Rapid Selector', a little information turns up. One of the more interesting is this archive of Bruce Sterling's "Dead Media Project" mailing list (search for 'Rapid Selector'). Apparently it's mentioned in a book, Information and Secrecy: Vannevar Bush, Ultra, and the Other Memex by Colin Burke, published in 1994. That might be something to look at if all else fails, just because it might give names of people who were in the company at the time.

The other thing of note that I found was an actual description of the FileSearch machine, as given in a blurb in a Nov. 1961 library trade magazine (warning huge PDF). Here's the relevant part, from pp. 50-51:
The system is composed of a recording unit and a retrieval unit. The recording unit is a Flexowriter typewriter, 35mm planetary camera with code recording unit, recording table, lights and controls. The retrieval unit is a film transport, auto-matic code reader, request card reader, out-put viewer and hard copy printing unit. All are combined in a desk-size assembly. The recording unit photographs files of docu-ments along with their coded descriptions in the form of opaque spots. These are stored on microfilm. Filesearch serches docu-ments at the rate of 6,400 pages a minute, documents are displayed on a viewing screen and hard copies are automatically printed of desired.
So the 'barcodes' are definitely categorization data, not serial numbers.

The other page that turned up as mentioning the FileSearch machine was this one (warning huge, horribly-formatted XML page that will take Firefox ages to render). It turns out that it's the contents of a special collection at the Charles Babbage Institute at the University of Minnesota; somewhere in their collection they have two publications about the FileSearch: The FMA File Search System and The FMA FileSearch System. An Integrated Machine Solution to Information Storage and Retrieval, both published by the FMA Corporation. That might be worth following up; unfortunately neither are online.

I think honestly that the barcodes may be of limited usefulness. They're remnants of a categorization system that's 40+ years old now, and by all accounts wasn't that spectacular when it was new. At best, if you did manage to translate and import them to a computer, they'd just give you a bunch of keyworded descriptions for each document. You'd probably be better spending your time trying to get the documents themselves scanned, and then letting a modern system (like Google's) process the content so you can do full-text searches, rather than relying on the very limited number of keywords that some clerk would have assigned to each document when it was archived.

I'd imagine that there's lots of equipment around for digitizing 35mm movie film at fairly high resolutions; perhaps that could be put to use to scan the documents into a modern digital archive for further analysis.
posted by Kadin2048 at 12:41 AM on September 23, 2007


You might find further experts at BarCode1, whose sponsors would probably love to hear from you. There's also some good information on that page about how 2d barcodes work, in terms of finder bars and check blocks and stuff.
posted by Myself at 4:55 AM on September 23, 2007


Response by poster: wow guys, thanks for the analysis and useful tips. i'm definitely leaning toward finding a human that may have been involved with the original company, hopefully to find the one key piece of information: whether the FileSearch system used a numbered list of keywords, or somehow represented the keywords more directly in the 'barcodes' themselves.

if that doesn't pan out, i've also got an inter-library loan request in for a publication in a journal called "Systems" from 1965 that discusses the FMA FileSearch system in comparison to 4 rivals of the day. there might be more detail in that article.

i also plan to look in on that collection of documents at the Charles Babbage Institute...could be an actual user manual there!

cheer_leaders_to_your_funeral: was that analysis performed on the 8x8 grid shown in the company's logo? or was that from the scanned document? thanks!
posted by garfy3 at 12:59 PM on September 23, 2007


That was from the scanned document, not the logo.
posted by cheerleaders_to_your_funeral at 3:48 PM on September 23, 2007


Response by poster: ah, thanks.

On a different note, i'm wondering if there may be something to the "case 3, upper" set that you drew out above--that's the only set that contains a series of only 2 digit numbers. Not that that really connects to an alphabet scheme or anything...

Or perhaps the set that contains the widest spread of numbers is the preferable set, as that would correspond to the largest set of itemized keywords (if that's how it were done)?
posted by garfy3 at 4:18 PM on September 23, 2007


Well, what I was hoping for when getting the decimal numbers was that something obvious would pop out. Like dates, or numbers corresponding to letters in some character set. I cant see any dates, and it isn't ASCII.

It could be any arbitrary character set, but I don't think it is. It would make sense to group letters - ASCII has A-Z at codes 65-90 for example. These numbers are spread to wide to fit within an 26 letter alphabet.

It was worth a shot, but I'm starting to lean towards thinking it's unlikely to get anything meaningful out of purely the barcode.

The most important reason for that is this: If I were to design this system today I would store a serial number on the film, and have this correspond to keywords etc in some database. (Of course, I wouldn't store things on film either, so things were different back then)
posted by cheerleaders_to_your_funeral at 3:40 AM on September 24, 2007


Response by poster: I'm scanning a couple more examples from the microfilm transfer next week. However, in correspondence with someone who once wrote an article on CDEC and the indexing arrangement, I can now confirm that the binary barcodes are in fact coding numbers, specifically numbers from the Defense Intelligence Agencies wartime Subject Classification codes. These are apparently reproduced at the National Archives, similar to the Decimal Filing System in use by the Army in WWII (in which "201" means Personnel File, and "293" means Casualty File, and "314.7" means Historical Report, etc).

Once I have this in hand my plan is to use the code shown in the flickr photos, that document a subset of Narcotics investigations, and then run cheerleaders_to_your_funeral's binary decode of each line in the 8x8 grid, for both variants of white=1 /white=0 and black =1 / black =0. The resulting numbers I'll then compare against the subject classification numbers and see which method arrives me at a classification that would fit "narcotics."

I've also got to give a shout out to a librarian at the Special Collections dept of the Library of the University of Minnesota. They're sending me a copy of the user's manual for the FileSearch machine that was among their collection of the Charles Babbage Institute.

At the end of this all, I may post a link to a paper describing the whole code-scheme, and more example images. Thanks for all the good info and helpful analysis people!!

one side-note, I found that over the weekend the google-patent search for "FMA, inc" only turned up one hit. but this morning there was a second, and it describes the actual reading of the barcode (direction, timing marks, etc). somebody.
posted by garfy3 at 2:49 PM on September 24, 2007


Response by poster: ...may have been reading along, ahem
posted by garfy3 at 2:50 PM on September 24, 2007


Response by poster: Alright, here's the first image of a larger set of codes, for two documents that appear in sequence.

I can tell now from a look at Patent No. 3342978 that the scanner reads from left to right, starting at the top lefthand corner, and zig-zagging down the page. There's a sequential number buried in here somewhere, folks...I think?
posted by garfy3 at 8:35 AM on September 26, 2007


At the end of this all, I may post a link to a paper describing the whole code-scheme, and more example images.

That would be very cool, I'm very interested to hear how this project pans out.
posted by Kadin2048 at 9:16 PM on October 19, 2007


Response by poster: ****UPDATE: Lots to report.

I located a manual for the defunct machine in the library of a military history office in Pennsylvania, of all places. According to WorldCat, it's the only copy available to the public in the known librariverse.

Here's the solution to the code: flickr link.

Turns out the code at first looks similar to a standard IBM used in its punchcards, but on further digging, there are some big differences. Here's a wikipedia link showing a punched-tape with remarkable similarities.

Finally, here's a poorly reproduced photo of the machine itself, and an example of 35mm film capturing encoded index terms: flickr link.

In case you're interested, I'm now in the process of locating machine-vision software that might churn through this vast set of microfilm and de-code the original index. Came across something called "Sherlock" (can't find the right link) that looked promising. Any leads in that direction most welcome! And thanks for the help so far!

[I may also make an appearance at the annual history-of-technology meeting with a paper on this beast of a project. /blatant self-promotion]
posted by garfy3 at 12:14 PM on November 28, 2007


(I sent you email — if you can't find a piece of existing software to scan the barcodes, I'd be interested in trying to write some myself, just for the heck of it.)
posted by hattifattener at 5:35 PM on November 29, 2007


Response by poster: Not that anyone is still reading this thread, but I'll be giving a talk next week at the Vietnam Center's Annual Symposium on the successful decoding of this bit of history at Texas Tech University. Not sure what to expect of Lubbock, Texas, but if any mefites are in the area, drop me a note, eh?

Cheers
posted by garfy3 at 4:03 PM on March 6, 2008


« Older Can I ask my doc to remove a Dx from my chart?   |   Favorite History of Science Books Newer »
This thread is closed to new comments.