Please help me to figure out if any library catalog entries were lost during a file transfer from an old to a newer computer.
February 15, 2011

Attention please librarians, statisticians and other wonderful, helpful people. I'm doing a cataloging project in the archive for a small arts non-profit. I had been entering new books for several months into the catalog database, Athena. The very old computer that had been in use crashed without backup, but somehow revived. We (myself and two non-librarians) were able to transfer what seems like most of the files (floppy to flash in MARC) to another newer (but still old) computer, but there may still be some entries that were lost.

Another librarian I spoke with proposed that I conduct a random sample of the collection (books only) to try to determine if everything was transferred. I have no experience with random sampling other than a very brief overview in library school. I am not even sure where to start or if there may be a better solution. I don't think that there is a count of how many books are in the collection or if it I even need to know this information. Any advice or references would be much appreciated! Thanks!
posted by dancingfruitbat
As an IT guy, the answer to "are these two data sets identical" is almost always "create a hash and see if they match". Are you dealing with one large MARC file, or many small ones? If one large one, you could always create a MD5 hash of each of the files. If the hash is identical, the files are identical.

As a librarian, I'm betting that MARCedit has a feature that will help, even if it just involves opening the file/files and check to see the number of records in each.
posted by griffey at 7:01 PM on February 15, 2011

Hi, Mr. Dancing Fruitbat here. There aren't two datasets; the first one was lost when the computer died. The random sampling would be against the physical collection (I guess pick a book at random from the shelves and see if it's listed in the database?).
posted by Deathalicious at 8:35 PM on February 15, 2011

To begin to answer the OP's question, this kind of statistical sampling is done all the time in industry, in situations like this: "How many pieces of this product should we test to be sure this batch is OK?"

To do it, you need to know
  1. the number of item in the batch (e.g., 10,000 light bulbs)
  2. your criterion that the batch is OK (e.g., less than 1% bad bulbs in the whole batch)
Then sampling charts will tell you how many pieces to test, and the maximum number of defective samples to pass the test.

But in your case, it seems likely to me that missing files (or missing parts of files) would be apparent as a corrupted database. Why don't you just open the rescued database and see if it looks and behaves normally? I think that should be your first step.
Disclaimer: IANALibrarian and I've never used Athena, but I've always loved libraries.
posted by exphysicist345 at 11:26 PM on February 15, 2011

I'm not sure what I might be missing here, but can you not load the files into Athena on a fresh machine and look at them that way? What file format are the files in now?
posted by Riverine at 4:48 PM on February 16, 2011

Sorry--forgot to post the second part: You really can't get a ballpark idea of loss by the count of the records in the new set? I do think a count is useful as a starting point, even if you don't have an exact number of the old.

Also, if you'll describe the elements of the data, I might be able to suggest a checking protocol.

I am a librarian with significant experience in managing and migrating large amounts of metadata.
posted by Riverine at 4:52 PM on February 16, 2011

