Join 3,553 readers in helping fund MetaFilter (Hide)


Address Parsing 101, please!!
November 3, 2007 3:01 PM   Subscribe

ParseFilter: I have a CSV file full of leads I need to parse into a more, er, concise format. What would the hive mind recommend?

It seems to be quite a bit similar to this thread, except I've already got the data in CSV format. But that doesn't mean it's worth anything to me!

It looks like this: NAME, ADDR1, ADDR2, ADDR3, ADDR4. But it might as well be NAME, ONEBIGLONGSTRINGOFSTUFF. Sometimes city and state are in ADDR3 and sometimes in ADDR4. There might be email addresses or phone or fax numbers mixed in, too.

At first I thought I might just try to geocode each record, but I think there's probably a smarter option. Someone mentioned using sed in the other post, but I can't seem to figure out exactly how to go about doing that. Ruby would be peachy, too!
posted by cdmwebs to Computers & Internet (23 answers total) 2 users marked this as a favorite
 
What do you want to do with it? The other thread wanted to put items on a map; is that what you want to do?

Are all of the addresses inside the United States? If so, http://geocoder.us/ is a good, free resource for after you have your data in some useful (canonical or normalized) form.
posted by cmiller at 3:13 PM on November 3, 2007


Yes, I can see how that might be useful in answering.

Basically, I want to make sure it's a valid postal address for mapping and/or mailings. I know the USPS has an API, but you can't use it commercially, etc...

And really, I just want to learn how to parse things for myself also. I think it's pretty slick when Gmail asks me if I want to map an address, add an event to my calendar or track a package.
posted by cdmwebs at 3:24 PM on November 3, 2007


Not sure I understand what you're really asking, but it might be easiest to just dump it into Excel and then use smart filtering on each column to figure out what each column works out to in each record. That's not as automagic as a combination of sed and awk, but my experience is when the columns are not totally consistent, an automatic processing system becomes way too complicated (i.e. sed really does want columns to be consistent in order to do the kinds of search-replace or conditional things you might want to do to make sense of it.)

I too hate the idea of using Excel for such things but it really does do certain things very well, and this is one of them.
posted by drmarcj at 3:33 PM on November 3, 2007


Excel is a fine way to start out a project like this.

You need to spend some time with the data and understand it better before you can decide whether Excel, Ruby, or a pathologically eclectic rubbish lister is the right way to go.
posted by b1tr0t at 3:58 PM on November 3, 2007


if you truly want to learn how to parse text, perl is your friend.
Picking up "Learning Perl" (also known as the Llama Book) is a great first step, IMHO.

While I'm sure this isn't exactly what you need (since the mention that the data is not 100% structured), it's a place to start...


my $datafile="c:\path\to\your.file"; #path assumes OP is using windows
open(INFILE,$datafile);
my @data = <INFILE>;
close(INFILE);
foreach(@data) {
($name, $addr1, $addr2, $addr3, $addr4) = split(/,/, $_);
}


of course, since the data isn't normalized, you're going to want to learn how to use the match operator =~ and regular expressions to get the data into the correct variables.

good luck.
posted by namewithoutwords at 4:04 PM on November 3, 2007


I guess that's exactly the point. I want to normalize this data. I want to learn how to feed it what will basically be a concatenated string of those four fields an extract the data in a more recognizable format. I'll play with it a bit more. Thanks!
posted by cdmwebs at 4:13 PM on November 3, 2007


Well think about this way. For each concrete address format there will be a certain probably that any individual record will be in that format.

So for example the addresses might be in the format (City-State, Street, #, ZIP) 25% of the time and (City, State, Street-#, ZIP) 25% of the time and (City, State, Street, APT-zip) 18% of the time, and so on.

You don't need to know the exact figures, but if you look at say, 100 of these records (make sure they're random) and come up with parsing rules that will cover like 90 of them, then you'll be able to get most of your data.

But the problem is, right now, you have no idea just how many formats there might be, there could be 3 that cover 90% and 100 more that cover the rest. Or they could be evenly divided into 50 different types. Until you know the number of types of addresses and the coverage of each type (or the major types) you have no idea just how hard this task will be.

Also, PERL sucks. Might as well learn Awk or something. Ruby is the new hotness for these types of tasks.
posted by delmoi at 5:01 PM on November 3, 2007


(disclaimer: I don't know PERL, Awk, or Ruby. I do almost all my programming in Java)
posted by delmoi at 5:02 PM on November 3, 2007


delmoi: "PERL sucks" "I don't know PERL"

In other words, I have no firsthand experience whatsoever with what I speak negatively thereof.

And admonishing the OP that he would be better off using AWK instead of perl is like admonishing a java programmer that they'd be better off using C, or even assembler...
posted by namewithoutwords at 5:36 PM on November 3, 2007


I do this pretty regularly (my condolences) and the solutions are almost always gumshoe solutions rather than IT solutions.

So brt10t was right in that you need to get to know the data.

As I understand, you have all the necessary data, and it's CSV, but there might be extraneous commas leading to fields not matching up.

Is that because null values weren't created? (Meaning if the global format is name,add1,add1,city,state,zip and all you had for one record was the name and city, it's name,city as opposed to name,,,city,,)

Or is it because something crazy happened and you could have all the data with random commas stuck in for fun?

These are hypotheticals that you need to ask and answer and your answers will dictate the solution.

Like, if it's the first option, you'll need to write a parser that tries to figure out what each piece is. Addresses have a lot of clues. If they're all US addresses, even better. You'll have known variables, like state. You can know every version of state, and as such, you can match them. Of course match ",Iowa," to avoid calling "Iowa City" a state. If you know where the state is, it's likely that the city is right before it. Cities don't have numbers, so if there's a number in the field just before what the parser thinks is a state, there's a problem - either there's no city, or you misidentified the state. The zip code is usually right after the state. In this case, it's all numbers (plus maybe a hyphen). Again, test for what you expect. Names don't have numbers. Phone numbers don't have letters.

This describes a very basic parser. If you're parsing 1000 addresses, it'll kick out about 10-20 errors. Those are easy to deal with manually. If you're parsing a million, you'll need more logic.

But we don't know your data, only you do. Get to know your data and what its internal rules are. Then play by those rules.
posted by ochenk at 5:38 PM on November 3, 2007


I never meant to start a language war!

BTW, what's the big difference if all of the languages support regexes anyway? Is a string not a string? Doesn't it all come back to preference?

I guess that's the advice I was looking for - how to get started. I just assumed there was some magical code somewhere that would already do this for me. I sorta kinda figured that it would be, as ochenk put it, a gumshoe solution.

Some more info:
posted by cdmwebs at 6:18 PM on November 3, 2007


"yes, they're all US addresses (that I'm interested in, anyway)"

Unfortunately, you've got to consider what it is, not what you're interested in. If it's got addresses that that don't conform to a standard, you've got to figure out a way to identify those specifically as uninteresting. Otherwise, if you've got 60k address, and a parser chokes on 5k, are you going to be able to find the 500 U.S. addresses within all the non-U.S. errors?

"it's about 60k rows with 1-3 possible addresses per row"

So a single row could have 3 addresses? Yikes. That makes it a lot harder. Not impossible, but harder.

Good luck.
posted by ochenk at 6:43 PM on November 3, 2007


If it's always comma-delimited, and never quoted, then SED or AWK would do. If it's not, then I suggest using a library that handles the four or five variations and methods of quoting comma-as-data inside fields.

May I also suggest Python?
cmiller@zippy:~ $ cat t.csv 
one,two,three
four,five,six
cmiller@zippy:~ $ python
Python 2.5.1 (r251:54863, Oct  5 2007, 13:36:32) 
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> lines = csv.reader(open("t.csv"))
>>> for items in lines:
...    print items
... 
['one', 'two', 'three']
['four', 'five', 'six']
>>> lines = csv.reader(open("t.csv"))
>>> for items in lines:
...    print items[0], items[2]
... 
one three
four six
>>>

posted by cmiller at 6:47 PM on November 3, 2007


Could you post 20 lines or so of the CSV?
posted by signal at 7:13 PM on November 3, 2007


Two things, neither of them very helpful:

1) This Perl won't work:
my $datafile="c:\path\to\your.file"; #path assumes OP is using windows
because the slashes will be clobbered by the interpolation implied by the double-quotes. Use single quotes or forward slashes, and if it's CSV, splitting on plain commas won't work either, you should use a CSV module.

2) The data we're talking about, from the sound of it, is pretty much irretrievably broken, and no one program in any language will be able to fix it. You're going to need a combination of things, the main one of which is human problem-solving.

What you could do with a program is divide the data up into good, parseable ones, and bad, a-human-brain-is-needed ones.

You could do that as simply as parsing for the right number of fields, numbers in the zip code field and so on.

If you can tell us what a "good" record is like, I'm sure we can grab all the good ones quite simply, and see how many of them there are.
posted by AmbroseChapel at 7:31 PM on November 3, 2007


Okay, here you go: http://cdmwebs.com/files/40Lines.txt.

FYI, it's using pipes as delimiters and tildes as text wrappers, since the data inside can be a mess...

Thanks!
posted by cdmwebs at 8:09 PM on November 3, 2007


Wow, that's a mess. As best as I can tell, those pipes aren't delimiters. Those are just artifacts. For example, there's no reason for:

UNIT~|~ED STATES

to exist. Or

MIAM~|~I 33166

Or

5507 S. HOWELL~|~AVENUE,MILWAUKEE,WI

There are plenty of examples of pipes where they should be, but the error rate is pretty high. Plus, there's a very high percentage of international addresses.

Is there's something special about this dataset? If not, get a better data set.

If this dataset is the only available set, see if you can get previous versions.

If not, get rid of all the delimiters. Match countries, kick out anything that you're sure isn't US. Split the file by a known. (I'd go for zip.) See how it splits and if you need to append offsets from previous/subsequent records. Then pull out phones. Match state and city by lookup tables. You're left with names and street addresses. Try to spit by first number and see how big your error rate is.

It's not going to be pretty, but you should be able to start at about 50% success.

Or just outsource it.
posted by ochenk at 10:55 PM on November 3, 2007


Wow, really pretty broken.

Some of those lines look like payment details rather than addresses -- "to the order of"?

OK after a quick parse: It looks like the result of a really bad OCR scanning process.

If you paid for this data, then just ask for your money back. If you didn't then you need to decide how useful it can be to you because, to get it really clean, you're either going to have to fix it up yourself or pay someone else to do it.

If you want it split up so it's just easier to read, no problem. A quick Perl script will do that, but you're still going to need human intelligence to actually interpret it. Too many variables to code around.
posted by AmbroseChapel at 1:16 AM on November 4, 2007


Have a data-entry person translate it into Excel.
posted by rhizome at 3:23 AM on November 4, 2007


That data is pretty awful. It's not "csv", as you claimed; it's some other format. Some quick analysis and guesses:

It looks more like a rather stupid dump of a database, some place where there a difference between no-value and zero-length-string.

Tilde and pipe appear to be special characters. (Duh.) So, those can't appear in text, presumably.

Each line has 14 pipes in it, so you can probably split them into 15 groups that way first.

Then, if the group is empty, then it's no-value or NULL. Else, strip off the beginning and end the tilde, and the remainder is the value.

Hope that helps. I note that we're not yet even close to your question of what you want to do in the end. :\
posted by cmiller at 8:18 AM on November 4, 2007


Thanks for the input.

I kinda figured this would be the outcome. The data is actually aggregated from bills of lading. It's a list of everything that came into the country through the Port of Charleston, SC for about six months.

My first thought was to build sort of a validator that would Google each set of addresses as a single string of each four fields. Anything to help weed out the easier ones!

I'm going to take the sample down now. If someone else has a better solution, MeFi Mail me and I'll send the file.
posted by cdmwebs at 2:17 PM on November 4, 2007


Might consider looking into Amazon's Mechanical Turk.
posted by delmoi at 8:06 PM on November 11, 2007


@delmoi - Wow, that's interesting!
posted by cdmwebs at 3:17 PM on November 12, 2007


« Older Would you pay someone to take ...   |  I need an unlimited online bac... Newer »
This thread is closed to new comments.