Help me Normalize
May 4, 2012 7:27 AM   Subscribe

I need to split a field in a database into lots of fields, in order to normalise the database, but I can't figure out how to do it.

I want to make a one shot (or as few shots as possible) way to split up this field.
I've used excel to stitch together a bunch of XML files and pulled out two fields: RouteName(primary key) and Nodes.
The problem is that Nodes is actually a load of numbers which need to be separate.

REDROUTE / 125 456 88547 9658
BLUEROUTE / 125 554 88754 9857

So it should be RouteName, RouteOrder, Node (with the primary key being RouteName and RouteOrder)

REDROUTE / 0 / 125
REDROUTE / 1 / 456
REDROUTE / 2 / 88547
and so on.

The problem is that there are a lot of these, 7000 ish routes, with up to 116 nodes. My version of excel can't handle the whole lot.
It feels like this would be a pretty common problem and there would be a function built in somewhere to handle it.

The reason it needs to be a simple straightforward method is that I need to write it up and hand it over to less technical people to do possibly quite regularly as part of a longer process.
posted by Just this guy, y'know to Computers & Internet (6 answers total) 2 users marked this as a favorite
Are you using Text to Columns? That is what I would use to get each node into its own column. If it can't handle doing alls the column at once, you'll need to split them up somehow. Try cutting them in half, or at least cutting them all at 58 to see if that is doable.
posted by soelo at 7:39 AM on May 4, 2012

I highly recommend trying Google Refine to clean it up and split it. You may need to dig through some of their how-to videos to get a feel for it, but what you're describing sounds exactly like what it's great for.

After you've split it you can export back to CSV for merging into your database.
posted by odinsdream at 7:52 AM on May 4, 2012 [1 favorite]

Oh, and once you accomplish it with Refine, you can save your work as a script which can be re-executed against other similar datasets.
posted by odinsdream at 7:57 AM on May 4, 2012

This may be too much but: Ever use python? I think it comes with OSX these days. awk and sed would work as well.
import csv

for row in csv.reader(open('myfile.csv','rb')):
    for n in row[1].split(" "):
        print "%s,%d,%s" % (row[0],count,n)
        count += 1
In this example I print out a new CSV file. I could rewrite it so save it directly to disk, or I could just pipe the output to a file. I can expound more, but not sure if this is a too deep down the rabbit hole.
posted by teabag at 8:21 AM on May 4, 2012

ooh, I like google refine.
I think I can find a lot of uses for that. I think it will do the job.

teabag, I'm always trying to find time to better my Python, but everything is always too urgent. Thanks for the suggestion though.
posted by Just this guy, y'know at 8:28 AM on May 4, 2012

You can use jython within google refine. It is an enormously useful tool. Also, join the google refine google group- the people on it are incredibly helpful folks.
posted by rockindata at 11:23 AM on May 4, 2012 [1 favorite]

« Older That stinks!! No.. it SUCKS!   |   Whose mess is it? Newer »
This thread is closed to new comments.