Help me Normalize
May 4, 2012 7:27 AM Subscribe
I need to split a field in a database into lots of fields, in order to normalise the database, but I can't figure out how to do it.
I want to make a one shot (or as few shots as possible) way to split up this field.
I've used excel to stitch together a bunch of XML files and pulled out two fields: RouteName(primary key) and Nodes.
The problem is that Nodes is actually a load of numbers which need to be separate.
Example:
REDROUTE / 125 456 88547 9658
BLUEROUTE / 125 554 88754 9857
So it should be RouteName, RouteOrder, Node (with the primary key being RouteName and RouteOrder)
Example:
REDROUTE / 0 / 125
REDROUTE / 1 / 456
REDROUTE / 2 / 88547
and so on.
The problem is that there are a lot of these, 7000 ish routes, with up to 116 nodes. My version of excel can't handle the whole lot.
It feels like this would be a pretty common problem and there would be a function built in somewhere to handle it.
The reason it needs to be a simple straightforward method is that I need to write it up and hand it over to less technical people to do possibly quite regularly as part of a longer process.
I want to make a one shot (or as few shots as possible) way to split up this field.
I've used excel to stitch together a bunch of XML files and pulled out two fields: RouteName(primary key) and Nodes.
The problem is that Nodes is actually a load of numbers which need to be separate.
Example:
REDROUTE / 125 456 88547 9658
BLUEROUTE / 125 554 88754 9857
So it should be RouteName, RouteOrder, Node (with the primary key being RouteName and RouteOrder)
Example:
REDROUTE / 0 / 125
REDROUTE / 1 / 456
REDROUTE / 2 / 88547
and so on.
The problem is that there are a lot of these, 7000 ish routes, with up to 116 nodes. My version of excel can't handle the whole lot.
It feels like this would be a pretty common problem and there would be a function built in somewhere to handle it.
The reason it needs to be a simple straightforward method is that I need to write it up and hand it over to less technical people to do possibly quite regularly as part of a longer process.
This may be too much but: Ever use python? I think it comes with OSX these days. awk and sed would work as well.
posted by teabag at 8:21 AM on May 4, 2012
import csv for row in csv.reader(open('myfile.csv','rb')): count=0 for n in row[1].split(" "): print "%s,%d,%s" % (row[0],count,n) count += 1In this example I print out a new CSV file. I could rewrite it so save it directly to disk, or I could just pipe the output to a file. I can expound more, but not sure if this is a too deep down the rabbit hole.
posted by teabag at 8:21 AM on May 4, 2012
Response by poster: ooh, I like google refine.
I think I can find a lot of uses for that. I think it will do the job.
teabag, I'm always trying to find time to better my Python, but everything is always too urgent. Thanks for the suggestion though.
posted by Just this guy, y'know at 8:28 AM on May 4, 2012
I think I can find a lot of uses for that. I think it will do the job.
teabag, I'm always trying to find time to better my Python, but everything is always too urgent. Thanks for the suggestion though.
posted by Just this guy, y'know at 8:28 AM on May 4, 2012
You can use jython within google refine. It is an enormously useful tool. Also, join the google refine google group- the people on it are incredibly helpful folks.
posted by rockindata at 11:23 AM on May 4, 2012 [1 favorite]
posted by rockindata at 11:23 AM on May 4, 2012 [1 favorite]
This thread is closed to new comments.
posted by soelo at 7:39 AM on May 4, 2012