How can I separate this data column properly?
September 21, 2012 8:30 AM Subscribe
Question about splitting data into columns: I have one column of data that I'd like to split into seven separate columns. A typical, complete row in the column looks like "mefi:1,ask:3,meta:2,projects:5,jobs:5,music:6,irl:4" where the number values can be anywhere from 1-6. The problem is that if any one or more of the seven subsites are left unanswered, the entire value is missing (there is no "[subsite]:0" or "," as a placeholder). Consequently, separating the column into seven distinct columns using the comma as the separator causes the data to fall/shift into the wrong output columns if any one or more of the seven subsite categories are missing. How can i fix this?
I am using Google Refine, which is great and can handle regexes. I would prefer NOT to put this into Excel, as I will lose my encoding of special characters in the other columns if I save it out...I'd like to not break up and re-merge the data/columns if at all possible. I also have SQL at my disposal (and am quite familiar with its features; although writing queries from scratch is a bear).
tl;dr: how can insert commas into rows of data in places where a value is expected but not found? Willing to use or tinker with any regex or script that will do the job, but need a starting point at the very least.
Any other ways of tackling this problem would be great. This is 2,000 rows of data, btw.
I am using Google Refine, which is great and can handle regexes. I would prefer NOT to put this into Excel, as I will lose my encoding of special characters in the other columns if I save it out...I'd like to not break up and re-merge the data/columns if at all possible. I also have SQL at my disposal (and am quite familiar with its features; although writing queries from scratch is a bear).
tl;dr: how can insert commas into rows of data in places where a value is expected but not found? Willing to use or tinker with any regex or script that will do the job, but need a starting point at the very least.
Any other ways of tackling this problem would be great. This is 2,000 rows of data, btw.
#!/bin/perl
@cols = {'mefi', 'ask', 'projects', 'jobs', 'music', 'irl'};
while (<>) {
$pos = 0;
foreach $val ( split /,/, $_ ) {
if ( $val =~ m/@cols[$pos]:(\d+)/ ) {
print $1;
} else {
print 0;
}
print ",";
}
print "\n";
}
>
Wholly untested - written in browser. Could probably be done more succintly, but should work.
posted by jferg at 9:33 AM on September 21, 2012
I'm a bit confused. This single column of data is in Google Refine? Why can't you export it as CSV, then re-import it as CSV (in which case it should recognize the commas and add blank values as appropriate)? If it does skip the empty values, then jferg's script looks reasonable to apply to a CSV file (except @cols is missing 'meta').
posted by A dead Quaker at 9:46 AM on September 21, 2012 [1 favorite]
posted by A dead Quaker at 9:46 AM on September 21, 2012 [1 favorite]
If you do end up using Excel, the following formulas will extract the value for each site into its own column. If a value is missing, the formulas will return a zero.
Assuming your data starts in cell A2.
A2: "mefi:1,ask:3,meta:2,projects:5,jobs:5,music:6,irl:4"
B2: =VALUE(IF(ISERROR(FIND("mefi:",A2)),0,MID(A2,FIND("mefi:",A2)+LEN("mefi:"),1)))
C2: =VALUE(IF(ISERROR(FIND("meta:",A2)),0,MID(A2,FIND("meta:",A2)+LEN("meta:"),1)))
D2: =VALUE(IF(ISERROR(FIND("ask:",A2)),0,MID(A2,FIND("ask:",A2)+LEN("ask:"),1)))
E2: =VALUE(IF(ISERROR(FIND("irl:",A2)),0,MID(A2,FIND("irl:",A2)+LEN("irl:"),1)))
F2: =VALUE(IF(ISERROR(FIND("projects:",A2)),0,MID(A2,FIND("projects:",A2)+LEN("projects:"),1)))
G2: =VALUE(IF(ISERROR(FIND("music:",A2)),0,MID(A2,FIND("music:",A2)+LEN("music:"),1)))
H2: =VALUE(IF(ISERROR(FIND("jobs:",A2)),0,MID(A2,FIND("jobs:",A2)+LEN("jobs:"),1)))
posted by Jasper Friendly Bear at 10:04 AM on September 21, 2012
Assuming your data starts in cell A2.
A2: "mefi:1,ask:3,meta:2,projects:5,jobs:5,music:6,irl:4"
B2: =VALUE(IF(ISERROR(FIND("mefi:",A2)),0,MID(A2,FIND("mefi:",A2)+LEN("mefi:"),1)))
C2: =VALUE(IF(ISERROR(FIND("meta:",A2)),0,MID(A2,FIND("meta:",A2)+LEN("meta:"),1)))
D2: =VALUE(IF(ISERROR(FIND("ask:",A2)),0,MID(A2,FIND("ask:",A2)+LEN("ask:"),1)))
E2: =VALUE(IF(ISERROR(FIND("irl:",A2)),0,MID(A2,FIND("irl:",A2)+LEN("irl:"),1)))
F2: =VALUE(IF(ISERROR(FIND("projects:",A2)),0,MID(A2,FIND("projects:",A2)+LEN("projects:"),1)))
G2: =VALUE(IF(ISERROR(FIND("music:",A2)),0,MID(A2,FIND("music:",A2)+LEN("music:"),1)))
H2: =VALUE(IF(ISERROR(FIND("jobs:",A2)),0,MID(A2,FIND("jobs:",A2)+LEN("jobs:"),1)))
posted by Jasper Friendly Bear at 10:04 AM on September 21, 2012
Why will you lose special characters if you bring it into Excel? If you import as fixed-width and put everything in a single column, then take Jasper's approach, I'd think Excel would pass everything untouched?
In my job I've asked our programmers to do this a few times, and they always wrote a quick perl script (a la jferg), although we were working with millions of rows.
posted by RobinFiveWords at 12:33 PM on September 21, 2012
In my job I've asked our programmers to do this a few times, and they always wrote a quick perl script (a la jferg), although we were working with millions of rows.
posted by RobinFiveWords at 12:33 PM on September 21, 2012
Is this what you want? A proper CSV file?
posted by zengargoyle at 8:53 PM on September 21, 2012
#!/usr/bin/perl
use strict;
use warnings;
my @cols = qw( mefi ask meta projects jobs music irl );
# unspecified fields get a zero
my %defaults = map { $_ => 0 } @cols;
# column headers
print join(',', @cols), "\n";
# pick your source
# while (<>) { # read from STDIN
while () { # read from DATA section
chomp;
my %raw = split /[,:]/;
my %fixed = ( %defaults, %raw );
print join(',', @fixed{@cols}), "\n";
}
__DATA__
mefi:1,ask:3,meta:2,projects:5,jobs:5,music:6,irl:4
mefi:1,meta:2,projects:5,jobs:5,music:6,irl:4
mefi:1,ask:3,meta:2,jobs:5,music:6,irl:4
mefi:1,ask:3,meta:2,projects:5,music:6,irl:4
>
Give output like:
mefi,ask,meta,projects,jobs,music,irl
1,3,2,5,5,6,4
1,0,2,5,5,6,4
1,3,2,0,5,6,4
1,3,2,5,0,6,4
posted by zengargoyle at 8:53 PM on September 21, 2012
God I hate escaping code. That should be:
posted by zengargoyle at 8:58 PM on September 21, 2012
while (<DATA>) { # read from DATA section
posted by zengargoyle at 8:58 PM on September 21, 2012
« Older Help me find a vocal sample from an electro swing... | No lumps, some aches, plenty of fear. Newer »
This thread is closed to new comments.
posted by iamkimiam at 8:38 AM on September 21, 2012