How can I separate this data column properly?
September 21, 2012 8:30 AM   Subscribe

Question about splitting data into columns: I have one column of data that I'd like to split into seven separate columns. A typical, complete row in the column looks like "mefi:1,​ask:3,​meta:2,​projects:5,​jobs:5,​music:6,​irl:4" where the number values can be anywhere from 1-6. The problem is that if any one or more of the seven subsites are left unanswered, the entire value is missing (there is no "[subsite]:0" or "," as a placeholder). Consequently, separating the column into seven distinct columns using the comma as the separator causes the data to fall/shift into the wrong output columns if any one or more of the seven subsite categories are missing. How can i fix this?

I am using Google Refine, which is great and can handle regexes. I would prefer NOT to put this into Excel, as I will lose my encoding of special characters in the other columns if I save it out...I'd like to not break up and re-merge the data/columns if at all possible. I also have SQL at my disposal (and am quite familiar with its features; although writing queries from scratch is a bear).

tl;dr: how can insert commas into rows of data in places where a value is expected but not found? Willing to use or tinker with any regex or script that will do the job, but need a starting point at the very least.

Any other ways of tackling this problem would be great. This is 2,000 rows of data, btw.
posted by iamkimiam to Computers & Internet (7 answers total) 1 user marked this as a favorite
 
Response by poster: I suppose another way to go about this would be to write a script that means "If you see "[subsite]:[number]" then copy that section into column X. If that doesn't exist, put "0" in column X instead." If somebody could help me write that script, that'd be great.
posted by iamkimiam at 8:38 AM on September 21, 2012



#!/bin/perl

@cols = {'mefi', 'ask', 'projects', 'jobs', 'music', 'irl'};
while (<>) {
$pos = 0;
foreach $val ( split /,/, $_ ) {
if ( $val =~ m/@cols[$pos]:(\d+)/ ) {
print $1;
} else {
print 0;
}
print ",";
}
print "\n";
}

Wholly untested - written in browser. Could probably be done more succintly, but should work.
posted by jferg at 9:33 AM on September 21, 2012


I'm a bit confused. This single column of data is in Google Refine? Why can't you export it as CSV, then re-import it as CSV (in which case it should recognize the commas and add blank values as appropriate)? If it does skip the empty values, then jferg's script looks reasonable to apply to a CSV file (except @cols is missing 'meta').
posted by A dead Quaker at 9:46 AM on September 21, 2012 [1 favorite]


If you do end up using Excel, the following formulas will extract the value for each site into its own column. If a value is missing, the formulas will return a zero.

Assuming your data starts in cell A2.

A2: "mefi:1,ask:3,meta:2,projects:5,jobs:5,music:6,irl:4"
B2: =VALUE(IF(ISERROR(FIND("mefi:",A2)),0,MID(A2,FIND("mefi:",A2)+LEN("mefi:"),1)))
C2: =VALUE(IF(ISERROR(FIND("meta:",A2)),0,MID(A2,FIND("meta:",A2)+LEN("meta:"),1)))
D2: =VALUE(IF(ISERROR(FIND("ask:",A2)),0,MID(A2,FIND("ask:",A2)+LEN("ask:"),1)))
E2: =VALUE(IF(ISERROR(FIND("irl:",A2)),0,MID(A2,FIND("irl:",A2)+LEN("irl:"),1)))
F2: =VALUE(IF(ISERROR(FIND("projects:",A2)),0,MID(A2,FIND("projects:",A2)+LEN("projects:"),1)))
G2: =VALUE(IF(ISERROR(FIND("music:",A2)),0,MID(A2,FIND("music:",A2)+LEN("music:"),1)))
H2: =VALUE(IF(ISERROR(FIND("jobs:",A2)),0,MID(A2,FIND("jobs:",A2)+LEN("jobs:"),1)))
posted by Jasper Friendly Bear at 10:04 AM on September 21, 2012


Why will you lose special characters if you bring it into Excel? If you import as fixed-width and put everything in a single column, then take Jasper's approach, I'd think Excel would pass everything untouched?

In my job I've asked our programmers to do this a few times, and they always wrote a quick perl script (a la jferg), although we were working with millions of rows.
posted by RobinFiveWords at 12:33 PM on September 21, 2012


Is this what you want? A proper CSV file?

#!/usr/bin/perl
use strict;
use warnings;

my @cols = qw( mefi ask meta projects jobs music irl );

# unspecified fields get a zero
my %defaults = map { $_ => 0 } @cols;

# column headers
print join(',', @cols), "\n";

# pick your source
# while (<>) {  # read from STDIN
while () {  # read from DATA section
    chomp;
    my %raw = split /[,:]/;
    my %fixed = ( %defaults, %raw );
    print join(',', @fixed{@cols}), "\n";
}

__DATA__
mefi:1,ask:3,meta:2,projects:5,jobs:5,music:6,irl:4
mefi:1,meta:2,projects:5,jobs:5,music:6,irl:4
mefi:1,ask:3,meta:2,jobs:5,music:6,irl:4
mefi:1,ask:3,meta:2,projects:5,music:6,irl:4
Give output like:

mefi,ask,meta,projects,jobs,music,irl
1,3,2,5,5,6,4
1,0,2,5,5,6,4
1,3,2,0,5,6,4
1,3,2,5,0,6,4

posted by zengargoyle at 8:53 PM on September 21, 2012


God I hate escaping code. That should be:
while (<DATA>) {  # read from DATA section

posted by zengargoyle at 8:58 PM on September 21, 2012


« Older Help me find a vocal sample from an electro swing...   |   No lumps, some aches, plenty of fear. Newer »
This thread is closed to new comments.