From 70 year old print to digital media.
August 18, 2007 7:11 PM   Subscribe

Building a searchable database of back issue LIFE magazine contents (1936-1972); articles, features, subjects, stories etc. What tools and how to proceed?

What's the quickest way to find every article about or picture of John Nance Garner that was ever printed in LIFE magazine? Or about Truk atoll? Or left handed people? The project is to either locate or to create a database that can be queried to find every issue of LIFE that had a story or picture of a particular subject. Something like a Readers Guide to Periodical Literature that covers all 1836 issues of LIFE published between October 1936 and December 1972. Tools currently available include print copies of contents indexes published by LIFE every six months. The listing and descriptive style and format of these indexes change greatly over the years. Also have scanner with OCR software, MS Office Pro suite, Filemaker Pro and Dragon Naturally Speaking Preferred.
posted by X4ster to Technology (8 answers total)
 
I know that you're looking for more than this, but you have seen the inde of sorts of Life Magazine that is online at pastpaper.com right?

If you were to actually create an index and cound convince funders that such a project was doable by you, that's likely something that you could get a grant for. One of my local libraries just got a grant from a state association to work on the indexing of their cemetary records and back issues [to the 1800's] of old newspaper microfilms. I'll leave it to the more expert librarians as to how you might go about this, but I will say that if you don't have the blessings of the copyright holders, whatever index you make if it includes fulltext and/or images is likely to be very problematic from a legal standpoint.
posted by jessamyn at 7:19 PM on August 18, 2007 [1 favorite]


Response by poster: Thanks jesamyn, I'm familiar with pastpaper and other similar sites. They can help me if I know the issue date of a particular feature, but can't help find issues that have stories on a specific subject. The end product I want would be as succinct as possible, no images and with text of only as many words as necessary to define a record.
posted by X4ster at 7:27 PM on August 18, 2007


One way to build this might be to set up a wiki and ask people to contribute articles that they might have on hand and would be willing to upload to the site.

If you had funding you might be able to acquire the subscription list to the people that had the magazine through the years. From there you might be able to contact readers who would have interest in supplying the content to the wiki.
posted by bkeene12 at 7:49 PM on August 18, 2007


Amazon Mechanical Turk might be useful with tagging and categorizing things cheaply.
posted by lunchbox at 7:55 PM on August 18, 2007


Response by poster: bkeene, I've got the indexes as well as all the the issues of the magazine, don't need the articles just an expeditious way to search for which issues contain specific subjects. Because subject matter indexes were printed every six months there is no master contents index.
posted by X4ster at 7:59 PM on August 18, 2007


This might be a bit of a naive (please forgive the absence of the accent) approach, but have you tried this?
Please note that I am assuming you are using Windows PC for this job - I don't know anything about Macs. I am also assuming that the indicies are in a similar format to a 'back of a book' index.
Save regularly.

1) Scan the indices using the OCR scanner into a text document
2) I have no idea how big your scan will be (never read LIFE magazine etc.), or how your OCR software works. If you end up with lots of little text files, use a program like Textpad to make one big text file.
If the format of the indicies ever changes, just join those text files that share the same format
2) Open the document in something like Excel
In 'open' command, set it to see all files and click on your scans)
Use the wizard that pops up to say that there is no special spacing between columns, click next and then manually set the spacings. Don't worry about undesired junk characters or the like - we'll clear those up later.
When that is done, click okay and see what you have - if it's not properly separated into columns, try again.
3) Remove the junk characters. If there are a lot of 'periods' in the text, use 'search & replace' to swap them with nothing (Ctrl +H, put the offending character in the find box and nothing in the replace with box). You might want to put double periods in the box, so you're less likely to remove periods that should be there. Save regularly.
4) You can remove unwanted spaces in front of the text in cells with the 'trim' command (Excel will talk you through it, respond to this post if you want more info)
5) There will be a lot of header lines and the like that you don't want. You need to look for a way for Excel to identify these so that you can delete them. I'd suggest looking for a column that is blank except where a header line is, Ctrl + A to select everything and then use Format -> Filter -> Auto Filter to create the auto filter triangles on the top row. Use that to select 'non-blank' cells in that column. When they are all selected (and the meaty stuff you want to keep isn't), select all the _rows_, right click and select delete rows. Remove the filter (click on the triangle, select show all). Save regularly.
6) Shuffle things around so that all the spreadsheets share the same column format.
7) The tedious bit - copy and paste each individual spreadsheet into one document. Alternate method: save the documents as text files again, and have textpad stitch them all together into one document. Then reopen that document into Excel as stated before - it'll be a lot easier this time.
8) Tidy up the document - add a header, sanity check the data (as you have been doing at each stage throughout), put that filter back in, freeze panes on the top row, etc. Save regularly, and remember that there is an Undo and a Redo command.
9) You should be able to use the filter command to just show the columns you're interested in. You have other options - pivot tables might be easier to use, and give you more versatility. If you want complex searches, select everything and paste it into Access. It would be an EXTREMELY ugly database (one table - shudder), but you can do some pretty fancy stuff in Access with queries.
If you do use Access, consider doing the really annoying job of breaking up the database into a few linked tables - save you from the all-too-easy scenario of searching for something that has been spelt incorrectly, or mis-translated by your OCR software.
posted by YAMWAK at 2:07 AM on August 19, 2007


Response by poster: YAMWAK,
Thanks for taking time to write your detailed response. Yours is the same process that I've been following, with minor differences. I was hoping that someone might be able to give me ideas to a quicker, simpler solution, but there may be no easier way.

LIFE magazine and the indexes that were published for them are tabloid sized, 11X14 inches, so to fit my scanner they have to either be cut into pieces or moved around on the bed to capture all the page. LIFE was published as a weekly for 36 years and for 30 of those years there were two indexes per year, giving me a total of 66 indexes of around 20 pages each. That's a lot of printed text to deal with. There are tens of thousands of individual bits of data in the magazines.

I buy and sell magazines and have customers who are seeking specific unique things and are willing to pay well for content on their interest. I'd like a quick way to locate their wants, and a database seems the logical answer. As you point out I'm ending up with is a single monster flat data table. I have both Access and Filemaker Pro, I prefer Filemaker for it's ease of use. The text scans go most easily to Excel where I use both the replace command and the speech recognition program to edit. From there it's easy to import to the database, but it's exceeding slow and time consuming.
posted by X4ster at 9:52 AM on August 19, 2007


As someone who works on digital library projects for a living, this is a huge project and I hope that your customers will pay you very very well for the functionality that you are providing.

That said, it sounds as if you are already doing it. Have fun. If you want to be able to search for things with relevance ranking, which will probably help you a lot, I have a few suggestions:
1. Save each article and/or issue (depending on your need) as a separate text file.
2. Use an easy-to-use text indexer like SWISH-E to index and help you search your digitized contents.

If you don't want to install SWISH-E and you have Windows, try Google Desktop. It will also index file contents, but I don't know if it relevance ranks results.
posted by rachelpapers at 2:32 PM on August 19, 2007


« Older How to maintain an iBrow piercing?   |   How do I grow garlic? Newer »
This thread is closed to new comments.