(Translation) How to count Japanese source characters in a PDF?
September 5, 2017 1:15 PM   Subscribe

New-ish translator here! I am attempting to count the source characters in a Japanese PDF, in order to come up with an estimate for a new job. I have employed a variety of different methods and programs (Word, CountAnything, JCount, MemoQ, etc.) and have even tried converting it into different formats like PPT/Word, yet the count discrepancy is significantly high for some reason. I am assuming that this is due to some rookie mistake of mine - what could I be doing wrong?

Thank you in advance. The PDF is editable by the way, so it is not a matter of the document being locked or anything. I am assuming that I am just doing something wrong since I am new and haven't worked with PDFs before.

PS: If anyone feels they need to see the document in order to help, please let me know - my boss told me that it is a 'public item', so sending it privately would not break any confidentiality clauses.
posted by CottonCandyCapers to Work & Money (4 answers total)
 
I usually just copy-pasted into a straight text editor, and then used that number.

What % of discrepancy are you getting? Is there any problem with just choosing the highest result (as long as it isn't an extreme outlier)?
posted by that girl at 10:50 AM on September 6, 2017


It's very likely that (some of) the programs are counting wrong, and you're doing everything right; Unicode is tricky under the best circumstances, and it's very easy to program an incorrect algorithm for counting graphemes/characters.

I'd recommend making a short test document with a known number of characters in it, running it through your programs, and then using the one that has the (most) accurate count.
posted by Aleyn at 1:26 PM on September 6, 2017


Are there alphanumeric characters in the PDF? This thread suggests that different clients (not sure about programs) may treat alphanumeric characters differently. Another thread suggests that if you haven't checked the option to include text within text boxes in Word, your word count will be an underestimation.
posted by katecholamine at 11:13 PM on September 6, 2017


So I got the PDF from CottonCandyCapers and used pdftotext from the Debian poppler-utils package to extract the text. Then ran it through some simple Perl for some quick counts. Mostly came up with the same numbers CottonCandyCapers and others came up with.

Just glossing over the data, I'd guess the minor discrepancies boil down to: a) conversion to text (my pdftotext wasn't 100% exact, but close enough), I'd guess maybe getting the text out of Adobe tools directly or something might be a bit better. b) the method used to determine just what type of character a character is.

Any sufficiently long complex Japanese text of this sort has lots of weird stuff like full/half width roman characters and numbers and spaces, circled digits, a plethora of open/close quote markers. Counts depend on exactly how you want to classify things. Whether those are punctuation/digits or whether they are Asian characters or whatnot.
posted by zengargoyle at 6:17 AM on September 7, 2017


« Older Holy Grail of Exercising, Exercising without...   |   Farmer Ants Newer »
This thread is closed to new comments.