<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: I'll be the most powerful man in Hill Valley, and I'm gonna clean up this data. </title>
	<link>http://ask.metafilter.com/223864/Ill-be-the-most-powerful-man-in-Hill-Valley-and-Im-gonna-clean-up-this-data/</link>
	<description>Comments on Ask MetaFilter post I'll be the most powerful man in Hill Valley, and I'm gonna clean up this data.</description>
	<pubDate>Wed, 05 Sep 2012 19:33:42 -0800</pubDate>
	<lastBuildDate>Wed, 05 Sep 2012 19:45:52 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: I&apos;ll be the most powerful man in Hill Valley, and I&apos;m gonna clean up this data.</title>
		<link>http://ask.metafilter.com/223864/Ill-be-the-most-powerful-man-in-Hill-Valley-and-Im-gonna-clean-up-this-data</link>	
		<description>Help me find this data analysis tool, so I can process lots of cool data. &lt;br /&gt;&lt;br /&gt; I have this giant set of user-entered data describing places that I&apos;m trying to classify. One specific task that I&apos;m having problems with is that people enter very similar but not identical information, so each of them show up individually. So I have records like:&lt;br&gt;
&lt;pre&gt;JOE&apos;S GAS&lt;br&gt;
JOE&apos;S GAS STATION&lt;br&gt;
JOE&apos;S GAS STATION PINEHURST DRIVE&lt;br&gt;
JOE&apos;S GAS STATIONS INC&lt;br&gt;
JOES GAS&lt;br&gt;
JOES GAS STATION&lt;/pre&gt;&lt;br&gt;
&lt;br&gt;
but it would make my life a lot easier if they were all linked together, so I don&apos;t have to classify all of them individually.&lt;br&gt;
&lt;br&gt;
I remember seeing, quite possibly here, a stand-alone Windows program that did this. It was a data analysis package, with a lot of other features and analytical capabilities, but there were robust functions for grouping these sorts of similar texts together using some sort of algorithm (I think I remember fuzzy clustering, but don&apos;t quote me). If memory serves, it was open-source or free or at least there was a free demo, and I seem to remember it being vaguely affiliated with Google. &lt;br&gt;
&lt;br&gt;
I remember there was a modest hubbub when it was released; there was a series of demo videos showing cool features of the program. As I said, I think I may have seen it on the blue, but I follow enough data blogs that I may have seen it elsewhere.</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2012:site.223864</guid>
		<pubDate>Wed, 05 Sep 2012 19:33:42 -0800</pubDate>
		<dc:creator>Homeboy Trouble</dc:creator>
		
			<category>data</category>
		
			<category>analysis</category>
		
			<category>parsing</category>
		
			<category>datacleaning</category>
		
			<category>datamining</category>
		
			<category>nerdy</category>
		
			<category>nerdery</category>
		
			<category>nerdosity</category>
		
			<category>resolved</category>
		
	</item>
	<item>
		<title>By: blahblahblah</title>
		<link>http://ask.metafilter.com/223864/Ill-be-the-most-powerful-man-in-Hill-Valley-and-Im-gonna-clean-up-this-data#3237238</link>	
		<description>There are a couple of options that come to mind: &lt;a href=&quot;http://code.google.com/p/google-refine/&quot;&gt; Google Refine&lt;/a&gt; is great tool for cleaning datasets.  &lt;a href=&quot;http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning&quot;&gt;Instructions on using it to clean names can be found at Pro Publica&lt;/a&gt;. It runs offline, so you don&apos;t have to upload your data to Google.  Also look at &lt;a href=&quot;http://vis.stanford.edu/wrangler/&quot;&gt;Wrangler&lt;/a&gt;, from Stanford, which offers a flexible visual language for data formatting and cleaning.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2012:site.223864-3237238</guid>
		<pubDate>Wed, 05 Sep 2012 19:45:52 -0800</pubDate>
		<dc:creator>blahblahblah</dc:creator>
	</item><item>
		<title>By: pompomtom</title>
		<link>http://ask.metafilter.com/223864/Ill-be-the-most-powerful-man-in-Hill-Valley-and-Im-gonna-clean-up-this-data#3237291</link>	
		<description>2nd google refine.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2012:site.223864-3237291</guid>
		<pubDate>Wed, 05 Sep 2012 20:33:39 -0800</pubDate>
		<dc:creator>pompomtom</dc:creator>
	</item><item>
		<title>By: iamkimiam</title>
		<link>http://ask.metafilter.com/223864/Ill-be-the-most-powerful-man-in-Hill-Valley-and-Im-gonna-clean-up-this-data#3237394</link>	
		<description>Google Refine, for sure. It&apos;s actually fun, too. I spent 10 hours on it the other day collapsing rows like it was Tetris.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2012:site.223864-3237394</guid>
		<pubDate>Wed, 05 Sep 2012 22:51:59 -0800</pubDate>
		<dc:creator>iamkimiam</dc:creator>
	</item><item>
		<title>By: Homeboy Trouble</title>
		<link>http://ask.metafilter.com/223864/Ill-be-the-most-powerful-man-in-Hill-Valley-and-Im-gonna-clean-up-this-data#3237714</link>	
		<description>Of course, Google Refine! &lt;br&gt;
&lt;br&gt;
Thanks; I knew the hive mind would figure out in twelve minutes what I&apos;d been bashing my brains in for all afternoon.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2012:site.223864-3237714</guid>
		<pubDate>Thu, 06 Sep 2012 08:37:25 -0800</pubDate>
		<dc:creator>Homeboy Trouble</dc:creator>
	</item>
	</channel>
</rss>
