it's full of tabs?
December 3, 2008 3:23 PM   Subscribe

How do I work with a tab-delimited file when one of the fields is full of tabs?

I have a tab-delimited file that I'm probably going to import into a MySQL database.

Only problem is, one of the fields often has tabs inside the data. It's a field that contained HTML, and the WYSIWYG interface of the originating service seems to have been perfectly happy to let the field be full of tabs.

The field in question is the last field in each row.

There isn't a set number of tabs in the field data; some rows have none, some seem to have dozens.

What can I do?
posted by epersonae to Computers & Internet (20 answers total)
 
So... let me see if I'm understanding correctly. The first n tabs in a given row are delimiters, and any tabs after that are not?

You could copy the data into excel, do a text-to-columns and delimit by tabs, then select all the columns that make up your last field (they should be all the way on the right at this point, starting with the nth column) and merge them again? Or concatenate them with some other character (spaces) in between? Then once you have clean columns, you could easily delimit by anything you want (including tabs).
posted by TSGlenn at 3:33 PM on December 3, 2008


Best answer: You can definitely handle this with a scripting language, if you know python or perl or any such thing, or know someone who does.

Basically, you walk through your file line-by-line. You split the line by tabs into an array. You then concatenate every substring in the array after N-1 into a single string, perhaps with spaces between them, and rebuild the row by concatenating your last-field string with all of the previous array items.

Hopefully someone can re-explain that better or provide pseudocode (or actual code), I would but I'm about to run out the door.
posted by Tomorrowful at 3:43 PM on December 3, 2008


Response by poster: Tomorrowful: I think that's it! (And actually, TSGlenn's answer pointed my brain in that direction, because I know that the file is a PITA to look at in Excel.) I could definitely do this with PHP. Not today, but I can definitely tackle it tomorrow when I'm fresher.
posted by epersonae at 3:51 PM on December 3, 2008


So it's tab data tab data tab tab tab tab?

Import it into MySql and let MySql truncate and warn about the tab-y "fields", or build a table with as many extra dummy rows as there are extra tabs, import into that table, and then in MySql do a create table select columna, columnb, columnc from importtable, omitting the dummy columns from the select list.

Really, it's easier to get it in the database and then do surgery (or let truncation do it for you) than do surgery first.
posted by orthogonality at 3:51 PM on December 3, 2008


orthogonality: the problem is that the number of 'dummy columns' is going to vary per row, so you'd have to do this line by line.
posted by Tomorrowful at 3:58 PM on December 3, 2008


The *real* solution is to have the data source provider convert to pipe-separated fields. Unless you're dealing with pieces of programming code, I've *never* seen a pipe ("|") be a valid character within a data field.

(Which doesn't mean the person generating the file still won't need to escape it some how.)
posted by baylink at 4:13 PM on December 3, 2008


perl -pe 'join "\t", map s/\t/\\\t/g, split "\t", $_, 5' infile.tab > outfile.tab

replacing "5" with the number of columns in the table.
posted by nicwolff at 4:19 PM on December 3, 2008


Oops, no - map doesn't return the resulting string, sorry. So,

perl -pe '@a = split "\t", $_, 5; $a[-1] =~ s/\t/\\\t/g; join "\t", @a;' infile.tab > outfile.tab

again replacing "5" with the number of columns in the table.
posted by nicwolff at 4:23 PM on December 3, 2008


Find and replace two tabs with one. Repeat until no lots of two tabs exist.
posted by b33j at 4:37 PM on December 3, 2008


tomorrowful, erersonae, I was going to suggest the same thing. It it was me, I'd use linux and a quick c program that does the algorithm you suggest, and just pipe the files through it. But that's because it's the only language I know.

What I would do is in the script, convert the file to something-else-delimited, so the actual import goes smoothly.

Since there is only one field that has tabs, and you want to keep those tabs, first, try to make that field the last one in each record/line. Have the script parse the records, outputting a comma, or some fixed number of spaces*, whenever it encounters a lone tab. Suppose each record has 6 fields. Put in logic that says "after you hit 5 delimiters, assume all tabs are part of the last record".

Or, is there by any chance some commonality in the html? Does every html field start with (body) and end with (/body) or something like that? You could have your script recognize those, and then output that and everything inside that as your new field.

Or commonality in the other fields? It's sort of inelegant, but if all the other fields have data that is able to be recognized in some way, you can use that to have your script "recognize" all the other fields and know that everything else is your html. If one field is a phone number, you can tell it that aaa-ppp-nnnn is always one field, or if another field is always going to be one of 8 different values, just program in those values as some kind of compare.

Or go really brute force, and use the script to automate a process where it asks you what each field is. Whenever it sees a tab, it outputs the previous string and asks you what to do with it. You can build in some of the other things, like how it can assume that multiple tabs are part of the html field. And make any other assumptions you can program in to it. Depending on the size of your data, it might be easier to sit there and answer questions for an hour than it would be to get the worlds greatest script working just right. I've done things like this before, and it's not bad. There was some logic puzzle I just couldn't solve programatically, so I just make the program a sort of corrector/formatter that I ran the output through a few times, until everything was correct. Or let the script find patterns for you, and maybe output 5 different files that adhere to a specific pattern, and then you can process those according to their individual pattern.

(This would make a great exam in a programming class. "OK students, here's your computer and your file. Make it work. Whoever gets it done, correctly, in the least time gets the most points."



* Just choose any delimiter that doesn't exist in the data. I'm pretty sure some importers allow you to specify custom delimiters, or even delimiter strings, so that you can say that 10 spaces is your delimiter, or "!@#$%" is your delimiter, so that you can be fairly sure real data isn't mistaken for a delimiter.
posted by gjc at 4:38 PM on December 3, 2008


Find and replace two tabs with one. Repeat until no lots of two tabs exist.

Doesn't quite work, because you might have an odd number of tabs.

nicwolff's solution looks good to my perl-ignorant eye.
posted by Tomorrowful at 5:21 PM on December 3, 2008


excuse me for sounding ignorant, but it sounds like the field originally contained text when it was exported. it's possible that all of your errant tabs are encased in quotes, which may make importing them easier.
posted by lester at 5:58 PM on December 3, 2008


The file format that you're looking for here, to aid in your googling, is known as "ragged right" and more advanced database systems (MS SQL Server Integration Services, for example) can handle it natively. It does precisely what you're looking for.

If you can get your hands on a copy of SSIS or probably any other high-end database, this is a piece of cake.
posted by toomuchpete at 6:27 PM on December 3, 2008


Best answer: $columns = 12;
$data = explode( "\n", file_get_contents( "path/to/file" ) );
foreach( $data as $row )
{
$fields = explode( "\t", $row, $columns );
// $fields is an array with $columns number of elements
// each corresponding to a column of the current row.

// Process as required.
}
posted by axismundi at 7:08 PM on December 3, 2008


o it's tab data tab data tab tab tab tab?

If this is the case, I don't see what your problem is. The tabs are extraneous and you can just leave them hanging off the end as long as you get a return after each line. Am I missing something here?
posted by The Light Fantastic at 9:27 PM on December 3, 2008


>>Find and replace two tabs with one. Repeat until no lots of two tabs exist.
Doesn't quite work, because you might have an odd number of tabs.


Well, try this visualisation if > is tab.

1st run through - Find two tabs, replace with one
abc > abc > abc >>>

2nd run through - Find two tabs, replace with one
abc > abc > abc >>

Final - Find two tabs, replace with one
abc > abc > abc >

Doesn't matter how many you have (odd or even), you end up with only one whereever there was one or many. If you don't like the line ending with a tab, once you have only single tabs left, then find and replace ^t ^p (or equivalent), with ^p (where ^t = tab, and ^p= line end).
posted by b33j at 4:04 AM on December 4, 2008


#!/bin/bash
#assuming there's 5 tabs that count, replace the rest with spaces.
IFS='
'
for line in `cat $1`
do
while ( echo $line | egrep -q '([^ ]* ){6}' ) #those are tabs in the ().
do
line=`echo $line | sed -r 's/(([^\t]*\t){5})(.*)\t/\1\3 /'`
done
echo $line
done
posted by sfenders at 5:28 AM on December 4, 2008


Response by poster: this seems to have done the trick:
$newitems = array();

//open the file 
$items = file('filename.txt');
foreach ($items as $item_num => $item) {
  //explode on \t
  $item = explode("\t",$item);
  //assign first items in the array to a new array
  foreach($item as $cell_num => $cell) {
    switch($cell_num) {
	case 0:
	  $newitems[$item_num][id] = $cell;
	  break;
	case 1:
	  $newitems[$item_num][categoryname] = $cell;
	  break;
	case 2:
	  $newitems[$item_num][categoryname] = $cell;
	  break;

/* huge frakking list of other cells */

        //concatenate all the remaining items into a single field
        //may not need all the tag/char stripping, but I'll figure that out later
	default:
	  $newitems[$item_num][a] .= ' ' . htmlspecialchars(strip_tags($cell));
	};
  };
};
those suggesting combining multiple tabs into a single tab: the tabs are scattered at "random" through the field, not concurrently. Otherwise the field contains text and html (the most important text in the file, btw)

baylink: alas, that's one thing I don't have any control over.

sfenders and axismundi: I just now saw your ideas, both of which also look useful. (OMG, explode has a limit option?! I had no idea.)
posted by epersonae at 8:27 AM on December 4, 2008


Response by poster: and yes, replacing that whole "foreach/case" thing with

$newitems[$item_num] = explode("\t",$item,'12');

worked perfectly. wow. you learn something new every day. :)
posted by epersonae at 8:31 AM on December 4, 2008


Oh, you kids with your fancy scripting languages...

sed 's/"/""/g
s/^/"/
s/$/"/
s/\t/","/
s/\t/","/
s/\t/","/
s/\t/","/
s/\t/","/
s/\t/","/
s/\t/","/
s/\t/","/
s/\t/","/
s/\t/","/
s/\t/","/' </path/to/input/file.tsv >/path/to/output/file.csv

posted by flabdablet at 11:28 PM on December 4, 2008


« Older Importing German video games to the US   |   What are some great ways to passively stay... Newer »
This thread is closed to new comments.