OCR scanner recomendation?
June 13, 2023 5:58 AM   Subscribe

I need to scan a collection of about 5,000 loose papers to PDF with OCR. I've been using a Fujitsu ScanSnap s1300-i. It's great for scanning a couple things once in a while, but not for a job this size. Can anyone recommend a better device to tear through a huge stack of papers quickly?

Ideally it would be great to find something with well-designed software that isn't clunky.

I would use a scanning service, but I wish to retain custody.

Any help or info appreciated. Thanks!
posted by trevor_case to Technology (14 answers total) 1 user marked this as a favorite
 
Fujitsu used to make a flatbed ScanSnap with an ADF that held somewhere around 500 pages. Some years back a client of mine used it and the included software to digitize a couple of decades worth of files. My recollection is that it wasn't terribly expensive, maybe $500.
posted by wierdo at 6:07 AM on June 13, 2023


How much money are you willing to spend? How consistently-sized are these papers? Fujitsu sells scanners with auto document feeders (ADF), but they're spendy, and that'll only help you if the papers are all consistent size and not crinkled up, so you can put a bunch of them in the feeder and just watch to make sure nothing jams.
posted by Alterscape at 6:10 AM on June 13, 2023


If you're comfortable with free software OCR tools (Tesseract is surprisingly good) and you aren't super concerned about the privacy of the scanned documents, you might consider going to a print shop that will let you scan them in with an office-grade scanner/feeder, taking the resulting PDF and doing the OCR step at home.
posted by mhoye at 6:31 AM on June 13, 2023 [1 favorite]


Best answer: Hi! I work for a company that sells scanners, but mostly on industrial-scales (businesses that are scanning hundreds or thousands of pages every day), but 5,000 pages is still a lot to scan.

We're a Canon scanner shop, so my knowledge is based on that; when you say 'loose papers', are they all letter- or legal-sized, or are they a variety of other sizes? Are they bigger than 8.5" on the short side?

If you have some money to spend: the DR-M260 is on the high end of what you need; it will eat through your 5000 pages in a day. You can find them used on eBay -- make sure they have the power supply -- and new they run around $800. Compare that to a scanning service and see how it weights out.

The Canon DR-C225 runs a bit cheaper, is a bit slower, around $400, but would do the job as well.

Both of these scan duplex, so front and back in one pass; they both come with software for scanning and OCRing the documents, however they come with both TWAIN and ISIS drivers so you can use any scanning software that supports those drivers.

From what I've seen in the competition, Fujitsu has similar scanners in the same price ranges as above -- I see the FI-7160 a lot which is comparable to the M260 -- I just don't know a lot about them.

But, the things to look for are:
  • document feeder that takes up to 50+ sheets at a time
  • duplex scanning
  • Page-Per-Minute number in the 20+ range (slower than that is painful)
  • scanner drivers that supports using 3rd party software
Also, I have done something similar as mhoye -- I scanned straight to non-OCR'ed PDFs and wrote some batch files to add OCR to them with Tesseract, and got better results than the built-in software produced.

OCR likes higher DPI, so maybe plan on scanning at 600dpi to ensure there's enough pixels for the OCR to recognize the letters. OCR depends on your computer's CPU -- the scanner doesn't do it -- so a slow computer can slow down the scanning process due to the amount of processing needed to OCR each page before it gets to the next one.

Edit: If you are pricing a service, they usually quote a price per page scanned; so if you do the math vs buying a scanner, 5,000 pages scanned on an $800 scanner is $0.16 per page, which is about what our service bureau charges for a scanned, proofed, and OCRed page.
posted by AzraelBrown at 6:54 AM on June 13, 2023 [8 favorites]


Best answer: Fujitsu sells scanners with auto document feeders (ADF), but they're spendy, and that'll only help you if the papers are all consistent size and not crinkled up, so you can put a bunch of them in the feeder and just watch to make sure nothing jams.

Came here to say this. I have a ScanSnap ix500 with a feed tray and it would be great for this. The newer model - ScanSnap ix1600 - is available for $420 on Amazon. The ADF holds 50 sheets at a time.
posted by NotMyselfRightNow at 7:23 AM on June 13, 2023 [1 favorite]


The question serious depends on how much money you're willing to throw at this and how much time you have.

There are industrial-grade document scanners that can scan both sides at a single pass as well as go up to 110 pages PER MINUTE, with 500 sheet feed tray, but that's like... 5000+ bucks (Canon DR-G2110)

If you plan to keep the scanner at under 1000, you'll get maybe 30 ppm, if you're lucky and it doesn't jam. :)
posted by kschang at 8:20 AM on June 13, 2023


If this is a one-time project, you may be able to rent a capable scanner for a week or so. Ricoh has a rental program, and I'm sure there are other companies that provide them (either manufacturers or office supply companies). No idea how much it would cost, but worth looking into.
posted by brianogilvie at 9:39 AM on June 13, 2023 [2 favorites]


Seconding the idea of renting an industrial grade scanner. I own a Fujitsu ix500 and really like it, but the thought of scanning 5,000 pages on it is very daunting.
posted by hovey at 11:11 AM on June 13, 2023 [1 favorite]


I'm going to nth the Fujitsu ScanSnap ix500 (or maybe newer ix600 but I have the 500). I've found the OCR software to be pretty good if the font is fairly straightforward, haven't tried unusual stuff. I'm sure it will do 5,000 pages but your limitation will be how many person-hours you have per day to assign to loading and scanning documents.

I would, myself, recommend setting it so that it stops scanning automatically when the ADF is empty, and setting the application to 'scan to folder', where it will automatically bring up the 'save' menu. Not sure if you can get it just assign a file name and skip that part (I don't remember right now). That way you can 'batch' the documents into piles of 'n' sheets and get a PDF of those pages saved. You can always split or combine them with Acrobat later.

As far as retaining custody goes - maybe you could put out a job post for someone with a scanner to come to your location to do the job - Install the scanner and software, scan the docs, make a backup copy of the files, uninstall the software.

No matter how you do it - what is your plan for reviewing for OCR errors and correcting them?
posted by TimHare at 1:14 PM on June 13, 2023


wierdo: Fujitsu used to make a flatbed ScanSnap with an ADF that held somewhere around 500 pages.

Years ago we got hold of an ginormous pile of prehistoric computer documents, in several paper sizes: letter, A4, A5, and more. Books, binders, sales folders, handwritten loose sheets, you name it. Because of the rarity we were compelled to scan them and upload them to archive.org. We started with 2, then 4 Fujitsu Fl6240 scanners with ADF.

Of course because of the age of the papers and the fact that the ten, fifteen preceding years didn't quite qualify as 'archival environment' (a.k.a. dry-ish cellar) there were jams and mis-scans as pages stuck together. But in all they just did the job with remarkably little effort past the loading of several thousand batches of pages.
posted by Stoneshop at 1:49 PM on June 13, 2023 [1 favorite]


Best answer: If you're comfortable with the command line (w/Linux/Mac/Win GUI options), Tesseract can zip through a directory of images. So, scan everything (ADF would really help!) into a directory (or PDF) and set Tesseract on it. This would be trading time for complexity, since you won't have to wait for OCR as part of the scanning process, but y'know...complexity.
posted by rhizome at 2:18 PM on June 13, 2023


I separated out the scanning from OCR recently. I had loads of old documents and eventually settled on a paperless-ng server on a decade old Mac mini. It does the OCR and loads more besides (tagging was a killer feature for me).

https://docs.paperless-ngx.com/#why-this-exists
https://github.com/paperless-ngx/paperless-ngx

Try it here demo.paperless-ngx.com using login demo / demo

The docker image is very easy to install and it allows you to just drag your docs onto it to scan (or ask it to hoover up a folder). I like it mostly as it then is a search engine on my local home network of the pdfs.

You still have to scan the docs obviously. Have a noodle at the demo.
posted by Vroom_Vroom_Vroom at 12:02 AM on June 14, 2023


> I would use a scanning service, but I wish to retain custody.

Some shops that offer a scanning service can do it while you wait, with some notice.
posted by yclipse at 4:41 AM on June 14, 2023


Last year, I scanned a book, roughly a thousand pages long. I documented my process, but some other details:
* scanning: my public library has a self-serve scanner with an automated feeder, and I think it only took an hour or two. I brought in a USB thumb drive. You might also check nearby universities.
* I also looked at getting a scanner from Craigslist; I'd suggest the keywords of scanner and adf or feeder.

I'm on a Mac, which influenced my software choices. The OCR options of interest to me:
* OCRmyPDF uses Tesseract, and puts the text in as another invisible layer, over the scan. So, it preserved the look, and also, searching for text works.
* ocrit uses Apple's Vision framework. It was higher-quality, though it's Mac-only software. This emits text, separate from the pdfs.
* both are free.
posted by Pronoiac at 3:02 AM on June 15, 2023


« Older Looking for online support/advice groups for...   |   I want to read about watching grass grow Newer »
This thread is closed to new comments.