How to scrub a .pdf?
January 20, 2009 11:05 AM Subscribe
Acrobat Pro filter: How can I remove unwanted highlighter from a color scanned pdf?
I just scanned a chapter from a library book which was annoyingly splattered with haphazard highlighter.
Using Acrobat Pro, what is the best way for me to get rid of this as I make the document black and white without it turning into distracting grayscale?
I just scanned a chapter from a library book which was annoyingly splattered with haphazard highlighter.
Using Acrobat Pro, what is the best way for me to get rid of this as I make the document black and white without it turning into distracting grayscale?
Yeah, I don't recall there being any image-editing functionality in Acrobat Pro. Even though your scan is saved as a PDF, it's really just an image (probably jpeg.) Something like Photoshop or Pshop Elements would be able to eliminate the highlighting.
posted by Thorzdad at 12:18 PM on January 20, 2009
posted by Thorzdad at 12:18 PM on January 20, 2009
Response by poster: Thanks! That makes sense. I'll try to do it tonight using photoshop. So, just to confirm, there's no way to somehow use Advanced>Print Production>Convert Colors to do it? All I want to do is get rid of every bit of the purple and yellow while leaving black intact.
posted by umbĂș at 12:27 PM on January 20, 2009
posted by umbĂș at 12:27 PM on January 20, 2009
If you're doing it in Photoshop, try using the Channel Mixer, check 'Monochrome', and use 100% of the green channel. I know it works for yellow highlighter, but I'm not so sure about the purple. I suspect it exists in the red and blue channels but not the green.
If it works, it would be very easy to apply to a lot of files.
posted by echo target at 1:02 PM on January 20, 2009
If it works, it would be very easy to apply to a lot of files.
posted by echo target at 1:02 PM on January 20, 2009
So, just to confirm, there's no way to somehow use Advanced>Print Production>Convert Colors to do it?
I'm pretty sure that just handles simple things like RGB-to-CMYK conversions for offset printing. Nothing as involved as spot removal of details in the image.
posted by Thorzdad at 1:52 PM on January 20, 2009
I'm pretty sure that just handles simple things like RGB-to-CMYK conversions for offset printing. Nothing as involved as spot removal of details in the image.
posted by Thorzdad at 1:52 PM on January 20, 2009
So, just to confirm, there's no way to somehow use Advanced>Print Production>Convert Colors to do it?
Thorzdad is correct. The problem is that, even though you're working with a PDF, the scanned document is essentially a big image. Acrobat (especially the newer versions) have some awesome document editing capabilities, but hardly anything within the realm of raster image editing. If this was a file that someone set up in Quark that had background boxes underneath designed to approximate the appearance of highlighter marker, you've have a number of options open to you. This case, not so much. Seconding what others said about importing the file into Photoshop and getting rid of it therein. Try this:
From Photoshop, select all of the text (for uniformity's sake, select the text that doesn't have highlighting on it too). Now, go to Curves (Image > Adjustments > Curves). In this instance, ignore the diagonal line you see in the grid that pops up, but see the dot in the lower left-hand corner? Grab that and start dragging to the right (don't go up) until the highlighter just disappears. That should get rid of the highlighter, but the text will look a bit faint around the edges.
To fix this, move your cursor to the dot in the upper right-hand corner of the grid. Pull it to the left until the text just looks dark again (don't go too far or it'll start to look blotchy). That's it.
I will say that this trick is heavily contigent on the quality of the text you're messing with. If the quality wasn't great to begin with this probably won't make it a whole lot better (but then again the annoying highlighting will be gone). If it was scanned at a high enough resolution though, this should do the trick.
posted by kryptondog at 2:16 PM on January 20, 2009
Thorzdad is correct. The problem is that, even though you're working with a PDF, the scanned document is essentially a big image. Acrobat (especially the newer versions) have some awesome document editing capabilities, but hardly anything within the realm of raster image editing. If this was a file that someone set up in Quark that had background boxes underneath designed to approximate the appearance of highlighter marker, you've have a number of options open to you. This case, not so much. Seconding what others said about importing the file into Photoshop and getting rid of it therein. Try this:
From Photoshop, select all of the text (for uniformity's sake, select the text that doesn't have highlighting on it too). Now, go to Curves (Image > Adjustments > Curves). In this instance, ignore the diagonal line you see in the grid that pops up, but see the dot in the lower left-hand corner? Grab that and start dragging to the right (don't go up) until the highlighter just disappears. That should get rid of the highlighter, but the text will look a bit faint around the edges.
To fix this, move your cursor to the dot in the upper right-hand corner of the grid. Pull it to the left until the text just looks dark again (don't go too far or it'll start to look blotchy). That's it.
I will say that this trick is heavily contigent on the quality of the text you're messing with. If the quality wasn't great to begin with this probably won't make it a whole lot better (but then again the annoying highlighting will be gone). If it was scanned at a high enough resolution though, this should do the trick.
posted by kryptondog at 2:16 PM on January 20, 2009
If you just need the raw text, you could use the OCR functionality of Acrobat Pro, then extract the text.
posted by me & my monkey at 3:04 PM on January 20, 2009
posted by me & my monkey at 3:04 PM on January 20, 2009
If you just need the raw text, you could use the OCR functionality of Acrobat Pro, then extract the text.
That is a very good idea. You'll likely lose a lot of formatting and might have the odd misspelled word, but the overall file will be much smaller and you won't have to deal with highlighter text or additional programs.
posted by HonorShadow at 3:25 PM on January 20, 2009
That is a very good idea. You'll likely lose a lot of formatting and might have the odd misspelled word, but the overall file will be much smaller and you won't have to deal with highlighter text or additional programs.
posted by HonorShadow at 3:25 PM on January 20, 2009
I don't know your platform or if you're willing to write code, but if you are on a Windows box and can work in .NET, then you might try my company's ColorExtractionCommand which is part of a larger document cleanup suite. Given a source image, it will detect if a document contains areas of color and will return those areas so you can pull them out.
posted by plinth at 4:42 PM on January 20, 2009
posted by plinth at 4:42 PM on January 20, 2009
This thread is closed to new comments.
If you have Photoshop, there are some relatively pain-free ways to do this. You could import the pages into the program by opening the PDFs with it, record an action to convert that range of colors to white (or whatever the page color is), and save. From there you run Photoshop's automation to use that action automatically on all the pages/files. Hope that helps.
posted by HonorShadow at 11:39 AM on January 20, 2009