How to identifying blank images?
April 27, 2006 1:48 PM   Subscribe

How can I quickly identify blank JPG images from a set of thousands?

I've recently converted tens of thousands of WMF files into JPG images. I've discovered that some (<1%) of the conversions failed, resulting in blank (100% white or 100% black) JPGs. I'd like to identify these and re-convert or trash them.

The images are all different dimensions, so I can't just sort by size to find the blanks. It would take forever to to scroll through all the thumbnails to find them, and there's no guarantee I wouldn't miss a few. Is there a free or cheap application out there that would automagically identify those blank images? I've been looking for an image browser that sorts by hue, but haven't found one yet. I've also looked for a tool that could compare images against a control image, but they all require the images to be the same size. I could use a Mac or Windows box for the task.
posted by danblaker to Computers & Internet (11 answers total)
 
In theory, those images should be really, really small - single colors compress very, very well be it GIF or JPEG.
posted by jedrek at 1:53 PM on April 27, 2006


Best answer: Do you know your largest image dimension? Make an all-black or all-white JPEG of those dimensions and save it, so that you get a sense for the upper limit fo filesize for these duds. Then sort by size and scroll through your thumbnails up to that general filesize.
posted by misterbrandt at 2:26 PM on April 27, 2006


I had a flash card issue that was solved by a process that may also be useful to you. This method is dependent on JPG header information and patience.
posted by whatzit at 2:40 PM on April 27, 2006


You could try a query by example with imgSeek. It does content based image queries. Last I checked, it wasn't great at much, but one thing I think it would be great it is finding other images that are mostly white or mostly black.
posted by Good Brain at 2:54 PM on April 27, 2006


You could write a program that goes through and generates histograms on the colors used for each picture and deletes the file if there is only 1 color. In Python using the PIL library it would be as simple as:

for fileName in listoffiles:
    h1 = Image.open(filename).histogram()
    if len(h1) < 1: os.remove(filename)br>

You might need to do something more fancy then just checking the length of the histrogram.
posted by gus at 2:57 PM on April 27, 2006


iView MediaPro can find duplicate images in a collection. They've got a 21-day demo for Mac or Windows. That would, in theory, leave one black and one white image in the set, which you coudl presumably use filesizes or thumbnails to eliminate.

Google also turned this thing up for Windows.
posted by chazlarson at 2:58 PM on April 27, 2006


s/coudl/could/
posted by chazlarson at 2:58 PM on April 27, 2006


Oh, wait. Nevermind. I totally glossed over the size issue. I think Iview may still work for you, though. I searched for dupes in a big, um, swimsuit collection, and it flagged originals and their thumbnails as dupes, even though they had different dimensions. I don't know if it would flag, say, a 100x150 all-white image as a duplicate of a 230x112 all-white image.
posted by chazlarson at 3:02 PM on April 27, 2006


Response by poster: gus' suggestion may be just the thing.

I do happen to have Python installed, but I don't know the first thing about writing a script. (Well, I do know the first thing--launch IDLE--but not much after that.)

Once I have PIL installed, I assume I'd need to use "import PIL" at the beginning of the script. What is "listoffiles" in that example, and if it's a txt file where does it go?
posted by danblaker at 3:30 PM on April 27, 2006


listoffiles is just a file and isn't specified in gus' script. So here's my untested pidgin Python version of the same thing that attempts to be runnable:



import os
from PIL import Image

listoffiles = os.listdir("images");

for fileName in listoffiles:
  h1 = Image.open(filename).histogram()
  if len(h1) < 1: os.remove(filename)


(gus' script had a formatting error, which I also corrected.)

This assumes your photos are in a flat directory called "images" within the directory of the script, that there's no other types of file in there either.
posted by abcde at 7:25 PM on April 27, 2006


Er, "...is just a file" - is just a variable, rather. What it isn't is a file as you suggested. Not an especially easy statement to mess up, but I managed.
posted by abcde at 7:30 PM on April 27, 2006


« Older MySQL Normalization Question   |   Where can I get good coffee in Boston? Newer »
This thread is closed to new comments.