Scanning printed numbers into Excel
February 3, 2019 11:58 PM Subscribe
I've been asked to copy ISBNs or similar identifying numbers from thousands of hard copy books, in to Excel. Is there an efficient way of doing this? I'm hoping for a technique that's easier than typing each number into a spreadsheet. For example, is there an app/device that will OCR selected text on a printed page (and exclude surrounding text)? Can the same app/device transfer the OCRed numbers in to a simple text/csv file - or directly in to Excel?
Assume the books were published too long ago to include barcodes. (If only it were that easy!)
Assume that the numeric identifiers we are interested in, are routinely buried amongst other small text. They may be prefaced with a variety of terms, including ISBN, SBN, 'catalogue number'. They may include dashes or spaces between digits.
Solutions should ideally be no/low cost. We might be able to fund a pen scanner, but only if it offers the best solution for this problem.
My employer can supply an iPad, plus a Windows laptop or desktop PC. I have an Android phone.
We have ABBYY FineReader, but it's only available on equipment on the opposite side of the building from where the books are stored - and is in use for other projects. (In other words, this strays in to 'typing the numbers will be much quicker' territory).
Ultimately we will be using the IBSN identifiers to query against national and international book databases, so we can:
1) check if we already have copies of the same books listed in our catalogue
2) pull matching records in to our catalogue, where required.
Many thanks.
Assume the books were published too long ago to include barcodes. (If only it were that easy!)
Assume that the numeric identifiers we are interested in, are routinely buried amongst other small text. They may be prefaced with a variety of terms, including ISBN, SBN, 'catalogue number'. They may include dashes or spaces between digits.
Solutions should ideally be no/low cost. We might be able to fund a pen scanner, but only if it offers the best solution for this problem.
My employer can supply an iPad, plus a Windows laptop or desktop PC. I have an Android phone.
We have ABBYY FineReader, but it's only available on equipment on the opposite side of the building from where the books are stored - and is in use for other projects. (In other words, this strays in to 'typing the numbers will be much quicker' territory).
Ultimately we will be using the IBSN identifiers to query against national and international book databases, so we can:
1) check if we already have copies of the same books listed in our catalogue
2) pull matching records in to our catalogue, where required.
Many thanks.
OP says this needs to work without barcodes.
I’ve heard good things about Anyline but don’t have use experience myself.
posted by Jon_Evil at 12:29 AM on February 4, 2019
I’ve heard good things about Anyline but don’t have use experience myself.
posted by Jon_Evil at 12:29 AM on February 4, 2019
Best answer: If nothing else presents itself, I have found dictating ISBNs to be much faster and more accurate than typing. iOS has dictating functionality built in.
posted by bluebird at 1:26 AM on February 4, 2019 [10 favorites]
posted by bluebird at 1:26 AM on February 4, 2019 [10 favorites]
Seconding bluebird - dictation is an easy way to also read off the book title before the ISBN and add it to your file, in case you encounter multiple titles for one number (it shouldn't happen! but it does!) when you reach the querying part of your project.
posted by wheek wheek wheek at 3:11 AM on February 4, 2019
posted by wheek wheek wheek at 3:11 AM on February 4, 2019
I would separate this into two separate tasks: Capture & Processing.
Capture should be relatively simple. Simply take a photo of the page containing the text you want to process.
Processing is much more difficult, but it's relatively simple to find apps that will extract text from images. The hard part is picking which text you want from everything that is generated. For this you will need some kind of parser that extracts the text you want (note this parser could actually be a human depending on the number of books and the variability of what you want to extract)
The reason for separating the tasks is that if you combine them, you will invariably find that you need to do the same thing for another piece of information in the future. In this fashion, you get to capture once and can reprocess the images to your hearts content in the future.
posted by NoDef at 6:58 AM on February 4, 2019 [3 favorites]
Capture should be relatively simple. Simply take a photo of the page containing the text you want to process.
Processing is much more difficult, but it's relatively simple to find apps that will extract text from images. The hard part is picking which text you want from everything that is generated. For this you will need some kind of parser that extracts the text you want (note this parser could actually be a human depending on the number of books and the variability of what you want to extract)
The reason for separating the tasks is that if you combine them, you will invariably find that you need to do the same thing for another piece of information in the future. In this fashion, you get to capture once and can reprocess the images to your hearts content in the future.
posted by NoDef at 6:58 AM on February 4, 2019 [3 favorites]
If the Windows PC has OneNote, it does a decent job of OCR when the text is clear in the picture. I'd test it on about 20 books by taking pics with the iPad and Android phone and importing them into (two different) OneNote pages. Then you right click on each picture and pick 'Copy Text from Picture' and paste it next to the photo. See if either device provides better OCR results. Sometimes if the font is squishy, One Note it will confuse B with 3. ISBNs are all digits or the letter X, so you can spot errors quickly.
Once you have the text, I would not worry about cleaning up the extra text until you have pasted it into Excel. Then you can filter for all lines with your suspected terms until you have the same number of records as books.
I would have a process for what to do when you come across books with no SBN at all.
posted by soelo at 8:11 AM on February 4, 2019
Once you have the text, I would not worry about cleaning up the extra text until you have pasted it into Excel. Then you can filter for all lines with your suspected terms until you have the same number of records as books.
I would have a process for what to do when you come across books with no SBN at all.
posted by soelo at 8:11 AM on February 4, 2019
Random lateral question: do you have a title list of these books or any other inventory? The reason I ask is because the answer may be different if you are a bookstore (i.e. "we need to know what edition of all of these books we have") or a library ("we just need to know if it's hardcover or paperback"). There are websites that can take a list of titles and turn them into ISBNs (and vice versa - I use isbn.nu and librarything for this) but if you really need each ISBN for each book and you have to touch all the books, I think I'd be just dictating them as bluebird and others mention.
posted by jessamyn at 8:20 AM on February 4, 2019 [2 favorites]
posted by jessamyn at 8:20 AM on February 4, 2019 [2 favorites]
Go over to the forums at LibraryThing and ask how the folks there process libraries of older books: they send out crews to visit historical sites and catalog all the books in a day, and I believe they have guides to doing so.
Short guide to "legacy libraries": https://www.librarything.com/legacylibraries
Discussion group: http://www.librarything.com/groups/iseedeadpeoplesbooks (Warning, v-e-r-r-r-y slow web server)
posted by wenestvedt at 9:05 AM on February 4, 2019 [3 favorites]
Short guide to "legacy libraries": https://www.librarything.com/legacylibraries
Discussion group: http://www.librarything.com/groups/iseedeadpeoplesbooks (Warning, v-e-r-r-r-y slow web server)
posted by wenestvedt at 9:05 AM on February 4, 2019 [3 favorites]
How about a USB OCR (and barcode) reader? http://www.jdldatasolutions.com/ocr-6500-wand-reader-usb/ (no affiliation, no experience with said device). Reminds me of 80's library character scanners.
posted by Muted Flugelhorn at 11:08 AM on February 4, 2019
posted by Muted Flugelhorn at 11:08 AM on February 4, 2019
Response by poster: Thanks for everyone's ideas so far. The dictation option is particularly interesting, as it sounds like a quick, affordable, and easy-to-learn process.
To answer Jessamyn's question, we are a library, dealing with a legacy of unprocessed collection donations. Unfortunately these donations rarely arrived with title lists/inventories. (When we do have electronic inventories, where possible we're developing processes to automate checking holdings and adding books to our catalogue, using Excel, OpenRefine, MARCEdit, coding etc).
We're trying to step away from having staff manually undertake database searches for the title/author of every donated book in hand - and staff manually adjusting the catalogue records for each donated book we want to add to the collection. Rather we want to make use of existing data, and bulk processes, to automate some of this work. If we streamline this work, it will afford us more time to catalogue unique items that are also sitting in the donations backlog. (Win!)
posted by brushtailedphascogale at 3:40 PM on February 4, 2019
To answer Jessamyn's question, we are a library, dealing with a legacy of unprocessed collection donations. Unfortunately these donations rarely arrived with title lists/inventories. (When we do have electronic inventories, where possible we're developing processes to automate checking holdings and adding books to our catalogue, using Excel, OpenRefine, MARCEdit, coding etc).
We're trying to step away from having staff manually undertake database searches for the title/author of every donated book in hand - and staff manually adjusting the catalogue records for each donated book we want to add to the collection. Rather we want to make use of existing data, and bulk processes, to automate some of this work. If we streamline this work, it will afford us more time to catalogue unique items that are also sitting in the donations backlog. (Win!)
posted by brushtailedphascogale at 3:40 PM on February 4, 2019
Out of curiosity, after reading this thread I fiddled around with this open source Android app: Character Recognition but couldn't get it to work. (Possibly because of the low resolution on my device's camera.)
posted by XMLicious at 7:32 PM on February 4, 2019
posted by XMLicious at 7:32 PM on February 4, 2019
This thread is closed to new comments.
Most apps will output an exportable CSV file. Apps that read barcodes are a nickel a dozen.
posted by Sunburnt at 12:05 AM on February 4, 2019 [1 favorite]