Join 3,524 readers in helping fund MetaFilter (Hide)


PDF copy/paste repeats text - help?
August 10, 2012 6:03 AM   Subscribe

When I copy/paste text from this PDF, the copied text is repeated and sometimes includes additional text. Why is this and how can I fix it?

This is someone's dissertation, so I can't provide the file. I'm trying to gather some information before contacting the student for trouble-shooting.

For instance, if I try to copy and paste the first line (assume it is "This multi-colored cat likes to eat fish and sometimes sleep"), it pastes as this:

This multi This multi This multi -colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored

If I try to copy/paste just the first few words "This multi-colored cat", it pastes as:

This multi This multi This multi -colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep colored cat likes to eat fish and sometimes sleep

When I select the text to paste, only some of the letters show up as highlighted, unlike on a normal/well-behaved PDF, where they all look highlighted.

Some other sections of the same PDF don't have this weird behavior. How can I fix this? Does it have anything to do with embedded fonts? I can have the student regenerate the PDF, but I would like to have a better idea of what causes this and how to fix it before I contact them.
posted by fussbudget to Computers & Internet (8 answers total)
 
Without the actual file it's a bit hard to answer.

You don't mention what OS you are using or what program you are opening to copy from or what program you are pasting into or what was used to create the PDF or if you get the same behavior using a different PC or Mac. There's just too many variables.

What I would do is try to open it in something other than just reader. (Preview on a mac for example).

If you get to a point where you can't do anything with the file I could drop it through some pro level stuff.
posted by cjorgensen at 6:10 AM on August 10, 2012


Acrobat Reader has an export to text file function. Alternatively, try pdftotext.

The graphical representation in a PDF can be completely separate from the text you can copy. Certain packages foul this up magnificently, and only a very few do it on purpose. Sometimes your only option is to print and scan back using OCR.
posted by scruss at 6:24 AM on August 10, 2012


I really can't provide a link to the file. I know that makes it difficult! I appreciate you all trying to help me even with this limitation.

I just tried opening this in Adobe Reader on an iPad and the copy/paste seems to work fine from there. I was originally using Acrobat Pro Version 9.5.1 on a Windows 7 PC. The other folks who have looked at this are also using PCs, and probably Adobe Reader. I am not sure what my colleagues are using, but I can probably find out if that would help.

Maybe the student created the file using a Mac, and the PDF conversion method they used results in a file that doesn't read correctly on a PC?
posted by fussbudget at 6:29 AM on August 10, 2012


Sorry - I asked for the file, stupidly.

One thing I'd suggest is asking the student to submit it with a vanilla font - something basic like Arial or TNR. I'd also get them to check if they have any text effects on. And finally make sure that they aren't PDFing it in review mode - i.e. accept all changes in the file.
posted by MuffinMan at 6:36 AM on August 10, 2012


If you go to File -> Properties you can see what they used to generate the file ('PDF Producer') that might give you some other clues
posted by yeahyeahyeahwhoo at 8:33 AM on August 10, 2012


I have found that columns and blocks copy wierdly from PDFs. You would think that a column of text would form a single block from top to bottom, but not always. PDF will give you half of that column and them jump into the caption of a picture, or another block of text. I have also found text boxes that did not appear in the visible document-- popped up as copy-pasta text.

like Muffin Man said, perhaps the PDF has saved version on top of version in the same file.
posted by ohshenandoah at 11:13 AM on August 10, 2012


Have you tried (is it possible to try) exporting it as HTML? I've found that's a way to get the text out of PDFs that sometimes works when other methods don't.

The example text you give is oddly mesmerizing. I find myself wanting to hear it sung as lyrics to a Philip Glass piece, or something.
posted by Lexica at 6:28 PM on August 10, 2012


Thank you for all the ideas! I tried running OCR on the file in Acrobat Pro, and that seemed to solve the problem. It gave me an error "Acrobat could not perform recognition (OCR) on this page because: This page contains renderable text." on almost every page while the OCR was running, but now the text selection and copy/paste is working normally.

This isn't a terribly satisfying resolution because we still don't know what caused the problem in the first place, but it's better in this case to be satisfied with the working file than to have to go back and forth with the student to try to identify and fix the root cause. If I start seeing this problem often, I'll dig deeper.

In case it helps people in the future with a problem like this, the PDF Producer in this case was Microsoft Word 2010. (Thanks for that tip, yeahyeahyeahwoo.)
posted by fussbudget at 6:11 AM on August 13, 2012


« Older Air conditioner won't turn on....   |  I've had a chronic/recurrent u... Newer »
This thread is closed to new comments.