Where are the good Windows-compatible PDF voice readers in 2024?
August 30, 2024 9:42 AM Subscribe
Ms flabdablet is supporting a student who uses voice reading to make documents more digestible. They're currently using the reader built into MS Edge to do this for PDF documents, but it talks gibberish when reading PDFs that have had violence done to their text flow by the printer driver that generated them. Do voice readers exist that can read out a PDF as rendered on the page, regardless of how badly chopped-up the text is inside the actual file? Are any of them free?
I have experience with this issue from a slightly different angle but I can apply a lot of that knowledge to this question. Which hopefully doesn't mean I overload you with this answer.
I think the question you're asking is a little off the mark, in the sense that 'voice readers that read a PDF as rendered' is not really what you should be looking for. What you should be doing is looking for is a 3rd thing which can turn the bad PDFs into something that is more suitable for a voice reader.
So the problem with PDFs is that there's many ways to skin the proverbial cat. You can construct them in many ways humans would see as having the same outcome, but the machine process to produce that output is wildly different. For example, I can produce a PDF you can read by embedding a screenshot of some text in the PDF, but a machine reading it would just see a lone image (not the text within it, since to the machine there is no text, just an array of pixels). Never mind the ambiguity of things like columns of text, where you or I might perceive a say a 2-column page, a machine might just see a bunch of lines with weird spacing. So that this text:
Anyhow, what to do is going to depend on the nature of your bad PDF:
—
1) The PDF has pictures of text instead of text. You want to use Optical Character Recognition (OCR) to produce text from the images. Adobe Acrobat can do this. You can also use a tool like ocrmypdf to try to do this. These processes are not perfect so expect only partial success.
2) Something else is up with the PDF. For example, it's confusing the voice reader because of something like the column problem I describe above. In this case you probably want to convert the PDF to plain text and try to fix that, then send the result through the voice reader. There are tools like pdftotext from a software package called
—
Unfortunately, the perfect solution to this problem doesn't really exist, and it may turn out to be very work-intensive to resort to individually inspecting "bad" PDFs and determining a course of action then executing it. Not to mention that even if you were willing to individually address the bad ones, what you may actually be thinking is "what the hell is a command line tool I'm scared of that" so maybe this would already be way outside your comfort zone.
How many PDFs are we talking, and how long are they? A handful with fairly well-behaved (i.e., predictable/regular) badness is probably doable. Feel free to memail me if you want more help, I may even be able to just bash out a couple of quick solutions with command line tools if you only need a handful of fairly short/well-behaved files handled.
posted by axiom at 4:19 PM on August 30 [2 favorites]
I think the question you're asking is a little off the mark, in the sense that 'voice readers that read a PDF as rendered' is not really what you should be looking for. What you should be doing is looking for is a 3rd thing which can turn the bad PDFs into something that is more suitable for a voice reader.
So the problem with PDFs is that there's many ways to skin the proverbial cat. You can construct them in many ways humans would see as having the same outcome, but the machine process to produce that output is wildly different. For example, I can produce a PDF you can read by embedding a screenshot of some text in the PDF, but a machine reading it would just see a lone image (not the text within it, since to the machine there is no text, just an array of pixels). Never mind the ambiguity of things like columns of text, where you or I might perceive a say a 2-column page, a machine might just see a bunch of lines with weird spacing. So that this text:
The quick brown fox jumps | Lorem ipsum dolor sit amet over the lazy dog. | consecteteur adipiscing elit.Comes out of a machine text converter as:
The quick brown fox jumps Lorem ipsum dolor sit amet over the lazy dog. consecteteur adipiscing elit.
Anyhow, what to do is going to depend on the nature of your bad PDF:
—
1) The PDF has pictures of text instead of text. You want to use Optical Character Recognition (OCR) to produce text from the images. Adobe Acrobat can do this. You can also use a tool like ocrmypdf to try to do this. These processes are not perfect so expect only partial success.
2) Something else is up with the PDF. For example, it's confusing the voice reader because of something like the column problem I describe above. In this case you probably want to convert the PDF to plain text and try to fix that, then send the result through the voice reader. There are tools like pdftotext from a software package called
poppler-utils
that can do this. There are also many online tools (e.g., this one from sejda), try searching for "pdf crop text conversion" — though those will not always be as configurable and infinitely-reusable as a command-line tool like pdftotext
. What you want to do is "crop" the input, so you could first extract the left-hand column, then the right-hand column. Interleave those two outputs and you get something that comes out in the correct reader order. In other words:
A | B <-page #1 C | D <-page #2Becomes...
A C <- pass #1 produces this list B D <- pass #2 produces this oneAfter interleaving these two passes you then get
ABCD
, the correct reader order. Sadly, if not every page of your PDF is 2-column, for example, this presents yet another hurdle.—
Unfortunately, the perfect solution to this problem doesn't really exist, and it may turn out to be very work-intensive to resort to individually inspecting "bad" PDFs and determining a course of action then executing it. Not to mention that even if you were willing to individually address the bad ones, what you may actually be thinking is "what the hell is a command line tool I'm scared of that" so maybe this would already be way outside your comfort zone.
How many PDFs are we talking, and how long are they? A handful with fairly well-behaved (i.e., predictable/regular) badness is probably doable. Feel free to memail me if you want more help, I may even be able to just bash out a couple of quick solutions with command line tools if you only need a handful of fairly short/well-behaved files handled.
posted by axiom at 4:19 PM on August 30 [2 favorites]
My husband has a thing that reads his law school textbooks to him that he really likes, so I sent him a link to this question and he texted back:
"I use Balabolka. It opens pdfs, docs & epub files as plain text and then reads the files to you"
posted by Jacqueline at 5:45 PM on August 30 [3 favorites]
"I use Balabolka. It opens pdfs, docs & epub files as plain text and then reads the files to you"
posted by Jacqueline at 5:45 PM on August 30 [3 favorites]
I was going to offer to MeMail you my husband's email address if you wanted him to test one of the problem PDFs for you before you bought it, but it looks like it's actually freeware:
https://www.cross-plus-a.com/balabolka.htm
I suggest you try installing it on your home machine and have your wife send you one of the PDFs that is garbled in Microsoft Edge, so that you can confirm that Balabolka can read it correctly before installing on the student's machine and potentially frustrating the student with yet another thing that doesn't work.
There's a portable version that can be run from a USB drive, which is handy because many school districts set up the student machines to not allow anything new to be installed, and the approval process to get an exception can be a time-consuming hassle.
Also, if the student's machine is a tablet that doesn't have a regular rectangular USB port, here's a USB Type-C flashdrive for under $10 that should work in the power port.
posted by Jacqueline at 6:02 PM on August 30
https://www.cross-plus-a.com/balabolka.htm
I suggest you try installing it on your home machine and have your wife send you one of the PDFs that is garbled in Microsoft Edge, so that you can confirm that Balabolka can read it correctly before installing on the student's machine and potentially frustrating the student with yet another thing that doesn't work.
There's a portable version that can be run from a USB drive, which is handy because many school districts set up the student machines to not allow anything new to be installed, and the approval process to get an exception can be a time-consuming hassle.
Also, if the student's machine is a tablet that doesn't have a regular rectangular USB port, here's a USB Type-C flashdrive for under $10 that should work in the power port.
posted by Jacqueline at 6:02 PM on August 30
Many years ago when I worked at Adobe, they had a basic demo of text to speech that ran on windows only. The synthesized voices sounded like garbage, so I wrote a version for the Mac to use the better voices. Here's what made that work reasonably well - Adobe invested a fair amount of research into accurate text extraction through an algorithm with the internal name Wordy. It understands a fair amount about reading order for different locales and does pretty well on non-rectilinear text (think maps). In later years, Adobe added discretionary reading order hints which solves the columnal challenges, but either has to be put in by a driver or it needs to be put in by hand.
I don't know how many other PDF applications use/honor this.
I will take a moment to explain just why text extraction is so challenging in PDF. The first problem is that page content is set up with a little language that kind of looks like this:
BeginText SetTextTransform(someMatrix) SetTextFont(PageResources.Font5) JustifyText("some te") JustifyText("xt here.") EndText
There are a number of innocuous things that represent possibly very deep things going on. The first is that you notice that the text isn't laid out in one go. This depends very much on both the driver and the generation software. troff was notorious for doing things like laying all the plain text, then all the italic text, then all the bold text. Font5 is what exactly? Well, we don't know from here. It could be any one of several different classes of fonts all of which include the ability to re-encode the font so that the letters in the string are actually something else entirely. You don't know. I know the guy who wrote Wordy and he did an astounding job with the tools that he had.
posted by plinth at 7:00 AM on September 3 [1 favorite]
I don't know how many other PDF applications use/honor this.
I will take a moment to explain just why text extraction is so challenging in PDF. The first problem is that page content is set up with a little language that kind of looks like this:
BeginText SetTextTransform(someMatrix) SetTextFont(PageResources.Font5) JustifyText("some te") JustifyText("xt here.") EndText
There are a number of innocuous things that represent possibly very deep things going on. The first is that you notice that the text isn't laid out in one go. This depends very much on both the driver and the generation software. troff was notorious for doing things like laying all the plain text, then all the italic text, then all the bold text. Font5 is what exactly? Well, we don't know from here. It could be any one of several different classes of fonts all of which include the ability to re-encode the font so that the letters in the string are actually something else entirely. You don't know. I know the guy who wrote Wordy and he did an astounding job with the tools that he had.
posted by plinth at 7:00 AM on September 3 [1 favorite]
You are not logged in, either login or create an account to post comments
posted by kschang at 11:55 AM on August 30