Join 3,512 readers in helping fund MetaFilter (Hide)


I'll be the most powerful man in Hill Valley, and I'm gonna clean up this data.
September 5, 2012 7:33 PM   Subscribe

Help me find this data analysis tool, so I can process lots of cool data.

I have this giant set of user-entered data describing places that I'm trying to classify. One specific task that I'm having problems with is that people enter very similar but not identical information, so each of them show up individually. So I have records like:
JOE'S GAS
JOE'S GAS STATION
JOE'S GAS STATION PINEHURST DRIVE
JOE'S GAS STATIONS INC
JOES GAS
JOES GAS STATION
but it would make my life a lot easier if they were all linked together, so I don't have to classify all of them individually.

I remember seeing, quite possibly here, a stand-alone Windows program that did this. It was a data analysis package, with a lot of other features and analytical capabilities, but there were robust functions for grouping these sorts of similar texts together using some sort of algorithm (I think I remember fuzzy clustering, but don't quote me). If memory serves, it was open-source or free or at least there was a free demo, and I seem to remember it being vaguely affiliated with Google.

I remember there was a modest hubbub when it was released; there was a series of demo videos showing cool features of the program. As I said, I think I may have seen it on the blue, but I follow enough data blogs that I may have seen it elsewhere.
posted by Homeboy Trouble to Computers & Internet (4 answers total) 21 users marked this as a favorite
 
There are a couple of options that come to mind: Google Refine is great tool for cleaning datasets. Instructions on using it to clean names can be found at Pro Publica. It runs offline, so you don’t have to upload your data to Google. Also look at Wrangler, from Stanford, which offers a flexible visual language for data formatting and cleaning.
posted by blahblahblah at 7:45 PM on September 5, 2012 [4 favorites]


2nd google refine.
posted by pompomtom at 8:33 PM on September 5, 2012


Google Refine, for sure. It's actually fun, too. I spent 10 hours on it the other day collapsing rows like it was Tetris.
posted by iamkimiam at 10:51 PM on September 5, 2012


Of course, Google Refine!

Thanks; I knew the hive mind would figure out in twelve minutes what I'd been bashing my brains in for all afternoon.
posted by Homeboy Trouble at 8:37 AM on September 6, 2012


« Older My husband is a crossdresser. ...   |  How do you know if someone is ... Newer »
This thread is closed to new comments.