Address Parsing 101, please!!
November 3, 2007 3:01 PM Subscribe
ParseFilter: I have a CSV file full of leads I need to parse into a more, er, concise format. What would the hive mind recommend?
It seems to be quite a bit similar to this thread, except I've already got the data in CSV format. But that doesn't mean it's worth anything to me!
It looks like this: NAME, ADDR1, ADDR2, ADDR3, ADDR4. But it might as well be NAME, ONEBIGLONGSTRINGOFSTUFF. Sometimes city and state are in ADDR3 and sometimes in ADDR4. There might be email addresses or phone or fax numbers mixed in, too.
At first I thought I might just try to geocode each record, but I think there's probably a smarter option. Someone mentioned using sed in the other post, but I can't seem to figure out exactly how to go about doing that. Ruby would be peachy, too!
It seems to be quite a bit similar to this thread, except I've already got the data in CSV format. But that doesn't mean it's worth anything to me!
It looks like this: NAME, ADDR1, ADDR2, ADDR3, ADDR4. But it might as well be NAME, ONEBIGLONGSTRINGOFSTUFF. Sometimes city and state are in ADDR3 and sometimes in ADDR4. There might be email addresses or phone or fax numbers mixed in, too.
At first I thought I might just try to geocode each record, but I think there's probably a smarter option. Someone mentioned using sed in the other post, but I can't seem to figure out exactly how to go about doing that. Ruby would be peachy, too!
Response by poster: Yes, I can see how that might be useful in answering.
Basically, I want to make sure it's a valid postal address for mapping and/or mailings. I know the USPS has an API, but you can't use it commercially, etc...
And really, I just want to learn how to parse things for myself also. I think it's pretty slick when Gmail asks me if I want to map an address, add an event to my calendar or track a package.
posted by cdmwebs at 3:24 PM on November 3, 2007
Basically, I want to make sure it's a valid postal address for mapping and/or mailings. I know the USPS has an API, but you can't use it commercially, etc...
And really, I just want to learn how to parse things for myself also. I think it's pretty slick when Gmail asks me if I want to map an address, add an event to my calendar or track a package.
posted by cdmwebs at 3:24 PM on November 3, 2007
Not sure I understand what you're really asking, but it might be easiest to just dump it into Excel and then use smart filtering on each column to figure out what each column works out to in each record. That's not as automagic as a combination of sed and awk, but my experience is when the columns are not totally consistent, an automatic processing system becomes way too complicated (i.e. sed really does want columns to be consistent in order to do the kinds of search-replace or conditional things you might want to do to make sense of it.)
I too hate the idea of using Excel for such things but it really does do certain things very well, and this is one of them.
posted by drmarcj at 3:33 PM on November 3, 2007
I too hate the idea of using Excel for such things but it really does do certain things very well, and this is one of them.
posted by drmarcj at 3:33 PM on November 3, 2007
if you truly want to learn how to parse text, perl is your friend.
Picking up "Learning Perl" (also known as the Llama Book) is a great first step, IMHO.
While I'm sure this isn't exactly what you need (since the mention that the data is not 100% structured), it's a place to start...
of course, since the data isn't normalized, you're going to want to learn how to use the match operator =~ and regular expressions to get the data into the correct variables.
good luck.
posted by namewithoutwords at 4:04 PM on November 3, 2007
Picking up "Learning Perl" (also known as the Llama Book) is a great first step, IMHO.
While I'm sure this isn't exactly what you need (since the mention that the data is not 100% structured), it's a place to start...
my $datafile="c:\path\to\your.file"; #path assumes OP is using windows
open(INFILE,$datafile);
my @data = <INFILE>;
close(INFILE);
foreach(@data) {
($name, $addr1, $addr2, $addr3, $addr4) = split(/,/, $_);
}
of course, since the data isn't normalized, you're going to want to learn how to use the match operator =~ and regular expressions to get the data into the correct variables.
good luck.
posted by namewithoutwords at 4:04 PM on November 3, 2007
Response by poster: I guess that's exactly the point. I want to normalize this data. I want to learn how to feed it what will basically be a concatenated string of those four fields an extract the data in a more recognizable format. I'll play with it a bit more. Thanks!
posted by cdmwebs at 4:13 PM on November 3, 2007
posted by cdmwebs at 4:13 PM on November 3, 2007
Best answer: Well think about this way. For each concrete address format there will be a certain probably that any individual record will be in that format.
So for example the addresses might be in the format (City-State, Street, #, ZIP) 25% of the time and (City, State, Street-#, ZIP) 25% of the time and (City, State, Street, APT-zip) 18% of the time, and so on.
You don't need to know the exact figures, but if you look at say, 100 of these records (make sure they're random) and come up with parsing rules that will cover like 90 of them, then you'll be able to get most of your data.
But the problem is, right now, you have no idea just how many formats there might be, there could be 3 that cover 90% and 100 more that cover the rest. Or they could be evenly divided into 50 different types. Until you know the number of types of addresses and the coverage of each type (or the major types) you have no idea just how hard this task will be.
Also, PERL sucks. Might as well learn Awk or something. Ruby is the new hotness for these types of tasks.
posted by delmoi at 5:01 PM on November 3, 2007
So for example the addresses might be in the format (City-State, Street, #, ZIP) 25% of the time and (City, State, Street-#, ZIP) 25% of the time and (City, State, Street, APT-zip) 18% of the time, and so on.
You don't need to know the exact figures, but if you look at say, 100 of these records (make sure they're random) and come up with parsing rules that will cover like 90 of them, then you'll be able to get most of your data.
But the problem is, right now, you have no idea just how many formats there might be, there could be 3 that cover 90% and 100 more that cover the rest. Or they could be evenly divided into 50 different types. Until you know the number of types of addresses and the coverage of each type (or the major types) you have no idea just how hard this task will be.
Also, PERL sucks. Might as well learn Awk or something. Ruby is the new hotness for these types of tasks.
posted by delmoi at 5:01 PM on November 3, 2007
(disclaimer: I don't know PERL, Awk, or Ruby. I do almost all my programming in Java)
posted by delmoi at 5:02 PM on November 3, 2007
posted by delmoi at 5:02 PM on November 3, 2007
delmoi: "PERL sucks" "I don't know PERL"
In other words, I have no firsthand experience whatsoever with what I speak negatively thereof.
And admonishing the OP that he would be better off using AWK instead of perl is like admonishing a java programmer that they'd be better off using C, or even assembler...
posted by namewithoutwords at 5:36 PM on November 3, 2007
In other words, I have no firsthand experience whatsoever with what I speak negatively thereof.
And admonishing the OP that he would be better off using AWK instead of perl is like admonishing a java programmer that they'd be better off using C, or even assembler...
posted by namewithoutwords at 5:36 PM on November 3, 2007
Best answer: I do this pretty regularly (my condolences) and the solutions are almost always gumshoe solutions rather than IT solutions.
So brt10t was right in that you need to get to know the data.
As I understand, you have all the necessary data, and it's CSV, but there might be extraneous commas leading to fields not matching up.
Is that because null values weren't created? (Meaning if the global format is name,add1,add1,city,state,zip and all you had for one record was the name and city, it's name,city as opposed to name,,,city,,)
Or is it because something crazy happened and you could have all the data with random commas stuck in for fun?
These are hypotheticals that you need to ask and answer and your answers will dictate the solution.
Like, if it's the first option, you'll need to write a parser that tries to figure out what each piece is. Addresses have a lot of clues. If they're all US addresses, even better. You'll have known variables, like state. You can know every version of state, and as such, you can match them. Of course match ",Iowa," to avoid calling "Iowa City" a state. If you know where the state is, it's likely that the city is right before it. Cities don't have numbers, so if there's a number in the field just before what the parser thinks is a state, there's a problem - either there's no city, or you misidentified the state. The zip code is usually right after the state. In this case, it's all numbers (plus maybe a hyphen). Again, test for what you expect. Names don't have numbers. Phone numbers don't have letters.
This describes a very basic parser. If you're parsing 1000 addresses, it'll kick out about 10-20 errors. Those are easy to deal with manually. If you're parsing a million, you'll need more logic.
But we don't know your data, only you do. Get to know your data and what its internal rules are. Then play by those rules.
posted by ochenk at 5:38 PM on November 3, 2007
So brt10t was right in that you need to get to know the data.
As I understand, you have all the necessary data, and it's CSV, but there might be extraneous commas leading to fields not matching up.
Is that because null values weren't created? (Meaning if the global format is name,add1,add1,city,state,zip and all you had for one record was the name and city, it's name,city as opposed to name,,,city,,)
Or is it because something crazy happened and you could have all the data with random commas stuck in for fun?
These are hypotheticals that you need to ask and answer and your answers will dictate the solution.
Like, if it's the first option, you'll need to write a parser that tries to figure out what each piece is. Addresses have a lot of clues. If they're all US addresses, even better. You'll have known variables, like state. You can know every version of state, and as such, you can match them. Of course match ",Iowa," to avoid calling "Iowa City" a state. If you know where the state is, it's likely that the city is right before it. Cities don't have numbers, so if there's a number in the field just before what the parser thinks is a state, there's a problem - either there's no city, or you misidentified the state. The zip code is usually right after the state. In this case, it's all numbers (plus maybe a hyphen). Again, test for what you expect. Names don't have numbers. Phone numbers don't have letters.
This describes a very basic parser. If you're parsing 1000 addresses, it'll kick out about 10-20 errors. Those are easy to deal with manually. If you're parsing a million, you'll need more logic.
But we don't know your data, only you do. Get to know your data and what its internal rules are. Then play by those rules.
posted by ochenk at 5:38 PM on November 3, 2007
Response by poster: I never meant to start a language war!
BTW, what's the big difference if all of the languages support regexes anyway? Is a string not a string? Doesn't it all come back to preference?
I guess that's the advice I was looking for - how to get started. I just assumed there was some magical code somewhere that would already do this for me. I sorta kinda figured that it would be, as ochenk put it, a gumshoe solution.
Some more info:
posted by cdmwebs at 6:18 PM on November 3, 2007
BTW, what's the big difference if all of the languages support regexes anyway? Is a string not a string? Doesn't it all come back to preference?
I guess that's the advice I was looking for - how to get started. I just assumed there was some magical code somewhere that would already do this for me. I sorta kinda figured that it would be, as ochenk put it, a gumshoe solution.
Some more info:
- yes, they're all US addresses (that I'm interested in, anyway)
- it's about 60k rows with 1-3 possible addresses per row
posted by cdmwebs at 6:18 PM on November 3, 2007
"yes, they're all US addresses (that I'm interested in, anyway)"
Unfortunately, you've got to consider what it is, not what you're interested in. If it's got addresses that that don't conform to a standard, you've got to figure out a way to identify those specifically as uninteresting. Otherwise, if you've got 60k address, and a parser chokes on 5k, are you going to be able to find the 500 U.S. addresses within all the non-U.S. errors?
"it's about 60k rows with 1-3 possible addresses per row"
So a single row could have 3 addresses? Yikes. That makes it a lot harder. Not impossible, but harder.
Good luck.
posted by ochenk at 6:43 PM on November 3, 2007
Unfortunately, you've got to consider what it is, not what you're interested in. If it's got addresses that that don't conform to a standard, you've got to figure out a way to identify those specifically as uninteresting. Otherwise, if you've got 60k address, and a parser chokes on 5k, are you going to be able to find the 500 U.S. addresses within all the non-U.S. errors?
"it's about 60k rows with 1-3 possible addresses per row"
So a single row could have 3 addresses? Yikes. That makes it a lot harder. Not impossible, but harder.
Good luck.
posted by ochenk at 6:43 PM on November 3, 2007
If it's always comma-delimited, and never quoted, then SED or AWK would do. If it's not, then I suggest using a library that handles the four or five variations and methods of quoting comma-as-data inside fields.
May I also suggest Python?
posted by cmiller at 6:47 PM on November 3, 2007
May I also suggest Python?
cmiller@zippy:~ $ cat t.csv one,two,three four,five,six cmiller@zippy:~ $ python Python 2.5.1 (r251:54863, Oct 5 2007, 13:36:32) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import csv >>> lines = csv.reader(open("t.csv")) >>> for items in lines: ... print items ... ['one', 'two', 'three'] ['four', 'five', 'six'] >>> lines = csv.reader(open("t.csv")) >>> for items in lines: ... print items[0], items[2] ... one three four six >>>
posted by cmiller at 6:47 PM on November 3, 2007
Two things, neither of them very helpful:
1) This Perl won't work:
2) The data we're talking about, from the sound of it, is pretty much irretrievably broken, and no one program in any language will be able to fix it. You're going to need a combination of things, the main one of which is human problem-solving.
What you could do with a program is divide the data up into good, parseable ones, and bad, a-human-brain-is-needed ones.
You could do that as simply as parsing for the right number of fields, numbers in the zip code field and so on.
If you can tell us what a "good" record is like, I'm sure we can grab all the good ones quite simply, and see how many of them there are.
posted by AmbroseChapel at 7:31 PM on November 3, 2007
1) This Perl won't work:
my $datafile="c:\path\to\your.file"; #path assumes OP is using windowsbecause the slashes will be clobbered by the interpolation implied by the double-quotes. Use single quotes or forward slashes, and if it's CSV, splitting on plain commas won't work either, you should use a CSV module.
2) The data we're talking about, from the sound of it, is pretty much irretrievably broken, and no one program in any language will be able to fix it. You're going to need a combination of things, the main one of which is human problem-solving.
What you could do with a program is divide the data up into good, parseable ones, and bad, a-human-brain-is-needed ones.
You could do that as simply as parsing for the right number of fields, numbers in the zip code field and so on.
If you can tell us what a "good" record is like, I'm sure we can grab all the good ones quite simply, and see how many of them there are.
posted by AmbroseChapel at 7:31 PM on November 3, 2007
Response by poster: Okay, here you go: http://cdmwebs.com/files/40Lines.txt.
FYI, it's using pipes as delimiters and tildes as text wrappers, since the data inside can be a mess...
Thanks!
posted by cdmwebs at 8:09 PM on November 3, 2007
FYI, it's using pipes as delimiters and tildes as text wrappers, since the data inside can be a mess...
Thanks!
posted by cdmwebs at 8:09 PM on November 3, 2007
Best answer: Wow, that's a mess. As best as I can tell, those pipes aren't delimiters. Those are just artifacts. For example, there's no reason for:
UNIT~|~ED STATES
to exist. Or
MIAM~|~I 33166
Or
5507 S. HOWELL~|~AVENUE,MILWAUKEE,WI
There are plenty of examples of pipes where they should be, but the error rate is pretty high. Plus, there's a very high percentage of international addresses.
Is there's something special about this dataset? If not, get a better data set.
If this dataset is the only available set, see if you can get previous versions.
If not, get rid of all the delimiters. Match countries, kick out anything that you're sure isn't US. Split the file by a known. (I'd go for zip.) See how it splits and if you need to append offsets from previous/subsequent records. Then pull out phones. Match state and city by lookup tables. You're left with names and street addresses. Try to spit by first number and see how big your error rate is.
It's not going to be pretty, but you should be able to start at about 50% success.
Or just outsource it.
posted by ochenk at 10:55 PM on November 3, 2007
UNIT~|~ED STATES
to exist. Or
MIAM~|~I 33166
Or
5507 S. HOWELL~|~AVENUE,MILWAUKEE,WI
There are plenty of examples of pipes where they should be, but the error rate is pretty high. Plus, there's a very high percentage of international addresses.
Is there's something special about this dataset? If not, get a better data set.
If this dataset is the only available set, see if you can get previous versions.
If not, get rid of all the delimiters. Match countries, kick out anything that you're sure isn't US. Split the file by a known. (I'd go for zip.) See how it splits and if you need to append offsets from previous/subsequent records. Then pull out phones. Match state and city by lookup tables. You're left with names and street addresses. Try to spit by first number and see how big your error rate is.
It's not going to be pretty, but you should be able to start at about 50% success.
Or just outsource it.
posted by ochenk at 10:55 PM on November 3, 2007
Wow, really pretty broken.
Some of those lines look like payment details rather than addresses -- "to the order of"?
OK after a quick parse:
If you paid for this data, then just ask for your money back. If you didn't then you need to decide how useful it can be to you because, to get it really clean, you're either going to have to fix it up yourself or pay someone else to do it.
If you want it split up so it's just easier to read, no problem. A quick Perl script will do that, but you're still going to need human intelligence to actually interpret it. Too many variables to code around.
posted by AmbroseChapel at 1:16 AM on November 4, 2007
Some of those lines look like payment details rather than addresses -- "to the order of"?
OK after a quick parse:
- four of the lines are empty
- the non-empty ones all contain at least two addresses, sometimes three, and the later addresses are all overseas
- sometimes the multiple addresses split by "~||~" but mostly they don't
- the fields, what they contain, and whether the records even contain things like country, zip, state, are pretty much completely random
If you paid for this data, then just ask for your money back. If you didn't then you need to decide how useful it can be to you because, to get it really clean, you're either going to have to fix it up yourself or pay someone else to do it.
If you want it split up so it's just easier to read, no problem. A quick Perl script will do that, but you're still going to need human intelligence to actually interpret it. Too many variables to code around.
posted by AmbroseChapel at 1:16 AM on November 4, 2007
Best answer: Have a data-entry person translate it into Excel.
posted by rhizome at 3:23 AM on November 4, 2007
posted by rhizome at 3:23 AM on November 4, 2007
That data is pretty awful. It's not "csv", as you claimed; it's some other format. Some quick analysis and guesses:
It looks more like a rather stupid dump of a database, some place where there a difference between no-value and zero-length-string.
Tilde and pipe appear to be special characters. (Duh.) So, those can't appear in text, presumably.
Each line has 14 pipes in it, so you can probably split them into 15 groups that way first.
Then, if the group is empty, then it's no-value or NULL. Else, strip off the beginning and end the tilde, and the remainder is the value.
Hope that helps. I note that we're not yet even close to your question of what you want to do in the end. :\
posted by cmiller at 8:18 AM on November 4, 2007
It looks more like a rather stupid dump of a database, some place where there a difference between no-value and zero-length-string.
Tilde and pipe appear to be special characters. (Duh.) So, those can't appear in text, presumably.
Each line has 14 pipes in it, so you can probably split them into 15 groups that way first.
Then, if the group is empty, then it's no-value or NULL. Else, strip off the beginning and end the tilde, and the remainder is the value.
Hope that helps. I note that we're not yet even close to your question of what you want to do in the end. :\
posted by cmiller at 8:18 AM on November 4, 2007
Response by poster: Thanks for the input.
I kinda figured this would be the outcome. The data is actually aggregated from bills of lading. It's a list of everything that came into the country through the Port of Charleston, SC for about six months.
My first thought was to build sort of a validator that would Google each set of addresses as a single string of each four fields. Anything to help weed out the easier ones!
I'm going to take the sample down now. If someone else has a better solution, MeFi Mail me and I'll send the file.
posted by cdmwebs at 2:17 PM on November 4, 2007
I kinda figured this would be the outcome. The data is actually aggregated from bills of lading. It's a list of everything that came into the country through the Port of Charleston, SC for about six months.
My first thought was to build sort of a validator that would Google each set of addresses as a single string of each four fields. Anything to help weed out the easier ones!
I'm going to take the sample down now. If someone else has a better solution, MeFi Mail me and I'll send the file.
posted by cdmwebs at 2:17 PM on November 4, 2007
Best answer: Might consider looking into Amazon's Mechanical Turk.
posted by delmoi at 8:06 PM on November 11, 2007
posted by delmoi at 8:06 PM on November 11, 2007
Response by poster: @delmoi - Wow, that's interesting!
posted by cdmwebs at 3:17 PM on November 12, 2007
posted by cdmwebs at 3:17 PM on November 12, 2007
This thread is closed to new comments.
Are all of the addresses inside the United States? If so, http://geocoder.us/ is a good, free resource for after you have your data in some useful (canonical or normalized) form.
posted by cmiller at 3:13 PM on November 3, 2007