Anyone know of any good data "massaging" tools and or practices?
January 14, 2005 3:07 PM Subscribe
DatabaseFilter: I've been given the task of converting a flat file into a more "grown-up" format. Trouble is, the data in each column differs wildly from record to record (ie: dates have been entered in a zillion different ways). Anyone know of any good data "massaging" tools and or practices?
Response by poster: I have a wide array of OS' at my disposal (Windows 2k3/XP, OS X, Linux and Solaris). The source data is from FileMaker Pro and will be converted for and bulk-loaded into an Oracle-based application (DB and App. Server).
posted by basicchannel at 3:36 PM on January 14, 2005
posted by basicchannel at 3:36 PM on January 14, 2005
Depending on time and what you're familiar with, I'd consider scanning to see what appears to be common formats, write translators for those, and flag anything that doesn't fit. In other words, look at the input and see if it's in, say xx/xx/xx format, and then parse it and see if the days/years/months make sense (numeric,valid year ranges, 31 days hath January, etc.). Lather, rinse, repeat. There's no really good way to do this, and you're probably going to have do a lot of review to make sure you're not translating incorrectly. You could iteratively do the ones that are less common, and the just messed up ones will have to be hand done. If there really was no standard, you will definitely have to do some of this by hand.
Sorry, it's not the magic you sere looking for. :(
posted by RikiTikiTavi at 3:37 PM on January 14, 2005
Sorry, it's not the magic you sere looking for. :(
posted by RikiTikiTavi at 3:37 PM on January 14, 2005
Oh, and a language with regular expressions is assumed, as I can't think of a better way to solve your problem (but if anyone does, by all means speak up!). Perl and Python are popular nowdays. There is an O'Reilly book on regular expressions that I like a lot, highly recommended if you're not too familiar with them.
posted by RikiTikiTavi at 3:39 PM on January 14, 2005
posted by RikiTikiTavi at 3:39 PM on January 14, 2005
How much data is this? If it's less than 100,000 records, Excel and Access actually does a pretty decent job of parsing out dates, if you tell them they're dates. Just pass it though them.
Otherwise -- hmm...I'm betting there's a good Perl module -- check CPAN.
posted by maschnitz at 3:42 PM on January 14, 2005
Otherwise -- hmm...I'm betting there's a good Perl module -- check CPAN.
posted by maschnitz at 3:42 PM on January 14, 2005
Response by poster: oops forgot to add:
As for my "skeelz", I have but a few years of working experience amongst the major DB's. Enough to be a n00b still, but also enough to "figure it out". Sorry if this is vague, it's difficult to articulate quite what you might need to hear.
As for RegEx... yea, I figured that would come into play. I reckoned, however, there might be some fancypants GUI frontend for such a thing for filtering and comparing (as it's not so much the finding of the variations in the data but the comparing and testing that I foresee as being the most time-consuming).
Thanks for the help guys/gals...
posted by basicchannel at 3:44 PM on January 14, 2005
As for my "skeelz", I have but a few years of working experience amongst the major DB's. Enough to be a n00b still, but also enough to "figure it out". Sorry if this is vague, it's difficult to articulate quite what you might need to hear.
As for RegEx... yea, I figured that would come into play. I reckoned, however, there might be some fancypants GUI frontend for such a thing for filtering and comparing (as it's not so much the finding of the variations in the data but the comparing and testing that I foresee as being the most time-consuming).
Thanks for the help guys/gals...
posted by basicchannel at 3:44 PM on January 14, 2005
Response by poster: Oh... it's approx. 6,000 and some change records.
posted by basicchannel at 3:45 PM on January 14, 2005
posted by basicchannel at 3:45 PM on January 14, 2005
I've used the Time::ParseDate perl module to do something like this, it works fairly well.
posted by cmonkey at 3:47 PM on January 14, 2005
posted by cmonkey at 3:47 PM on January 14, 2005
Actually, if you can process it with C, Perl, or php, you can use a function that typically goes by the name of something like 'strtotime' ... it will process a date string such as 14Jan05 or 01/14/05 or 14th of January 2004 into a unix epoch time, from which you can use date('Y-M-d") to put it into the ISO format that databases use as their standard.
Sorry, there isn't really a full procedure, but you can take and do something like
(pseudocode .. can be converted into perl easily)
1. Open the file.
{2. Take an individual line out of the file, in order.
3. Split the line into an array at each comma (assuming comma separated.)
4. At the nth field, which you know to be the date, run date('Y-m-d",strtotime($date));
5. join the array back together into a string with commas between each field, concatenate a newline (\r\n) onto the end.
6. Write the line either to another variable or to another file.
7. End of loop, increment to the next line in the file and go to step 4.}
8. Write everything to a file, close the file handle.
posted by SpecialK at 3:50 PM on January 14, 2005
Sorry, there isn't really a full procedure, but you can take and do something like
(pseudocode .. can be converted into perl easily)
1. Open the file.
{2. Take an individual line out of the file, in order.
3. Split the line into an array at each comma (assuming comma separated.)
4. At the nth field, which you know to be the date, run date('Y-m-d",strtotime($date));
5. join the array back together into a string with commas between each field, concatenate a newline (\r\n) onto the end.
6. Write the line either to another variable or to another file.
7. End of loop, increment to the next line in the file and go to step 4.}
8. Write everything to a file, close the file handle.
posted by SpecialK at 3:50 PM on January 14, 2005
(although is strtotime a perl function? I use it every day in PHP, but I can't remember if it's in Perl or if you have to load a ... Time::ParseDate module. Go cmonkey!)
posted by SpecialK at 3:51 PM on January 14, 2005
posted by SpecialK at 3:51 PM on January 14, 2005
Oops, and I mixed up the quoets in date() ... you want to use all single quotes.
posted by SpecialK at 3:51 PM on January 14, 2005
posted by SpecialK at 3:51 PM on January 14, 2005
Be really careful about trusting any automated method, and backup the data before every translation step because the sanity checks are only as good as the assumptions behind them. If someone has been mixing U.S. and European date formats, for instance, the parser could make some nasty mistakes.
Honestly, when the data is that messy it's worthwhile to step back and ask which data points NEED the effort invested. For instance, start by fixing the primary key field. Then any fields you need to search or sort against. If that doesn't include the date field, and confusion isn't likely to result from someone seeing mismatched date formats, then consider postponing alterations until it does matter.
posted by nakedcodemonkey at 4:02 PM on January 14, 2005
Honestly, when the data is that messy it's worthwhile to step back and ask which data points NEED the effort invested. For instance, start by fixing the primary key field. Then any fields you need to search or sort against. If that doesn't include the date field, and confusion isn't likely to result from someone seeing mismatched date formats, then consider postponing alterations until it does matter.
posted by nakedcodemonkey at 4:02 PM on January 14, 2005
As for RegEx... yea, I figured that would come into play. I reckoned, however, there might be some fancypants GUI frontend for such a thing for filtering and comparing (as it's not so much the finding of the variations in the data but the comparing and testing that I foresee as being the most time-consuming).
Dont give up! There are a lot of tools out there that are basically a GUI in front of a Regexp application. I even used one a few years ago that was very easy to use and could identify and convert a wide variety of date formats. I've been trying to remember it for you but no luck so far - if I do I'll post it.
You might try asking, not here, but at some place like dbforums.com What you are trying to do is a very common problem among data guys and someone in one of those forums could probably steer you in the right direction.
posted by vacapinta at 4:13 PM on January 14, 2005
Dont give up! There are a lot of tools out there that are basically a GUI in front of a Regexp application. I even used one a few years ago that was very easy to use and could identify and convert a wide variety of date formats. I've been trying to remember it for you but no luck so far - if I do I'll post it.
You might try asking, not here, but at some place like dbforums.com What you are trying to do is a very common problem among data guys and someone in one of those forums could probably steer you in the right direction.
posted by vacapinta at 4:13 PM on January 14, 2005
My experience is with Date::Parse (which is wonderful), but I suppose there's more than one way to do it. Whichever you use though, check the returned values for ones that haven't been filled in (like the year) and make sure to have a reasonable default.
posted by fvw at 6:34 PM on January 14, 2005
posted by fvw at 6:34 PM on January 14, 2005
if you're going with an automatic method, the safest approach is to re-use whatever code is currently using the data. presumably some existing program is parsing this file and has been modified over the years to accept a variety of kludges/formats. so grab those routines, rip them from the app, and use the data they parse to generate a uniform format. that way you guarantee not to corrupt the data, even if there are bugs in the existing code (the idea being that you rewrite the app to parse your clean data in a direct way - if you leave the old code in the app then you're open to obscure bugs where data going full circle (parser, writer) doesn't have the same value each cycle).
posted by andrew cooke at 4:25 AM on January 15, 2005
posted by andrew cooke at 4:25 AM on January 15, 2005
Response by poster: THE FMP form app they created doesnt really have any useful parsing of its own, but that was definitely a place I started... so good call andrew.
Thanks again friends! I got a lot of good info out of this and, probably as important, found avenues to explore to make the job better/easier.
posted by basicchannel at 12:03 PM on January 15, 2005
Thanks again friends! I got a lot of good info out of this and, probably as important, found avenues to explore to make the job better/easier.
posted by basicchannel at 12:03 PM on January 15, 2005
I don't think you're going to find anything thats 100% effective, so finding an intern or two to go through each record after you've done as much as you can batchfully (its a word cause I say its a word) will save future unforgiving-superior headache.
posted by softlord at 3:07 PM on January 15, 2005
posted by softlord at 3:07 PM on January 15, 2005
This thread is closed to new comments.
posted by maniactown at 3:34 PM on January 14, 2005