Help w/ importing flatfiles!
January 2, 2007 1:07 PM   Subscribe

Flat file importing best practices? tips? ideas?

Hi, I've been tasked with importing a few dozen flatfiles into our SQL Server database and I am beyond frustrated! If you know of an easier way/tool/or whatever, I would very much appreciate it!

The files are space delimited. Each field is enclosed by double-quote characters.

Problem is: Some of the text fields already contain double-quotes:

e.g. "Field 1" "Field 2" "Field 3="Value"" "Field 4="Value2" Extra Text" etc...

e.g.2 "Field 1" "Field 2" "Field 3="Value"" "Field 4="Value2" Extra Text" ""Field 5a" "Field5b""

SQL Server 2005's Import/Export Wizard chokes on these lines because the extra quotes throws it off. These files contain around 500,000 lines each and as far as I can tell, there are at least 10% of lines with this problem so this is a hell of a lot of lines to fix manually.

I've tried using a REGEX capable search and replace to try inteligently replaceing " w/ "" (escaped double quotes), but not sure how to reliably catch the extra quotes...

Your help would be greatly appreciated!
posted by apark to Computers & Internet (20 answers total) 1 user marked this as a favorite
 
Use a regex to change to a different delimiter, then get the import wizard to use that delimiter. Often, the pipe ("|") is used.

First make sure that your dat doesn't contain the new delimiter (Search for it, all is good if it's not there.)

Now replace double-quote space double-quote with new-delimiter. Also replace double-quote newline with new-delimiter newline (or in regex, "$).

Now tell the Import Wizard that fields are delimited by new-delimiter, records by newline, and that quotes aren't special.

In other words, from this:
e.g. "Field 1" "Field 2" "Field 3="Value"" "Field 4="Value2" Extra Text" etc...

to this:
e.g. Field 1|Field 2|Field 3="Value"|Field 4="Value2" Extra Text|

Note that this won't work properly for your example two.
posted by orthogonality at 1:17 PM on January 2, 2007


Is it a true statement that each of these 'columns' begins with:

"Field, followed by a #, then any text, then a " followed by a space?
posted by SirStan at 1:18 PM on January 2, 2007


#!/usr/bin/perl

while(<STDIN>){
s/\"Field \d+=//g;
s/\"\"/"/g;
print;
}


Bold is input, italic is output

./test.pl
"Field 1" "Field 2" "Field 3="Value"" "Field 4="Value2" Extra Text" ""Field 5a" "Field5b""
"Field 1" "Field 2" "Value" "Value2" Extra Text" "Field 5a" "Field5b"

posted by SirStan at 1:27 PM on January 2, 2007


Response by poster: To: orthogonality->Indeed, I wrote a best practices guideline to the effect that field delimiter must be either "|" or and the text delimiter must be "`" (backquote) for future providers.

I've checked all of these files and neither of these two characters occur naturally, however, as you've pointed out, the proposed method would fail in the 2nd example... :(

To: SirStan.> The answer is No. please see field 5 in the 2nd example...

posted by apark at 1:31 PM on January 2, 2007


Does every field end with (character that isnt a space)"
posted by SirStan at 1:36 PM on January 2, 2007


Not knowing perl I can't improve on SirStan's snippet, but I think it is not going to help in the Field 4 case. Depending on how whether the internally quoted data has spaces and quotes, I think that doing a replace of " " with ' ' might be a good answer. If the field data has " " as part of its string, find out how the file was created and look to recreating it in a reasonable fashion.
posted by bastionofsanity at 1:36 PM on January 2, 2007


Well, awk or sed are most likely going to do what you need. I don't have the regex-fu to do the whole thing, but for a start:
dexter:~ cfta$  sed -e 's/=\"[^\"]\"//g' flattest 
"Field 1" "Field 2" "Field 3="Value"" "Field 4="Value2" Extra Text"
"Field 1" "Field 2" "Field 3="Value"" "Field 4="Value2" Extra Text" ""Field 5a" "Field5b""
does part of what you wanted. The field 5a/5b thing is out of my (peewee) league, though.
posted by ctmf at 1:40 PM on January 2, 2007


Response by poster: To: bastionofsanity-> You hit the nail on the head. The internally quote data does have spaces AND double-quotes.

So, as in my Example 2
"Field 1" "Field 2" "Field 3="Value"" "Field 4="Value2" Extra Text" ""Field 5a" "Field5b""

Needs to end up in the database as

Name Value
Field1: Field 1
Field2: Field 2
Field3: Field 3="Value"
Field4: Field 4="Value2" Extra Text
Field5: "Field5a" "Field 5b"
posted by apark at 1:41 PM on January 2, 2007


rats. first MeFi post and I screwed it up.
That should have been:
dexter:~ cfta$  sed -e 's/=\"[^\"]*\"//g' flattest 
"Field 1" "Field 2" "Field 3" "Field 4 Extra Text"
"Field 1" "Field 2" "Field 3" "Field 4 Extra Text" ""Field 5a" "Field5b""
Cut and paste error. (sorry)
posted by ctmf at 1:43 PM on January 2, 2007


ctmf: You dropped the data for field3 and field4, but kept the field title.
posted by SirStan at 1:52 PM on January 2, 2007


Well, see? I suck at this. My point was only, this is the tool to use. I was hoping someone that actually knew more than me would jump on it, or that it would help narrow down a google search.

I would like to see the answer. I'm working on it at home now as a fun puzzle.
posted by ctmf at 2:11 PM on January 2, 2007


So, the meat of the problem is not having any definite means of identifying which opening quote is paired with a particular end quote. There are no properties of the format or data that are useful in determining this. I think going back to the source and trying to recreate these files in a sane format is the least painful approach.
posted by bastionofsanity at 2:17 PM on January 2, 2007



#!/usr/bin/perl -nw
use strict;

my $in_field = 0;
my @c = split //;
my $quotedepth = 0;
my (@field, $prev, $cur);
do {
$prev = $cur;
$cur = shift @c;
if ($cur =~ /\s/ and $prev eq '"' and $quotedepth % 2 == 0) {
my $field = join ', @field[1..$#field-1];
$field =~ s/"/\\"/g;
print "\"$field\"";
print " " if @c;
@field = ();
$quotedepth = 0;
}
else {
if ($cur eq '"') {
$quotedepth++;
}
push @field, $cur;
}
} while (@c);
print "\n";


produces

"Field 1" "Field 2" "Field 3=\"Value\"" "Field 4=\"Value2\" Extra Text" "\"Field 5a\" \"Field5b\""

for your second example. It is sensitive to there being exactly one space between fields and it assumes UNIX line endings (but is readily adaptable to DOS line endings.)

I'm not sure a regex could solve this (Perl's regexes being Turing-complete notwithstanding.) At any rate, I find it easier to write a custom micro-parser than to attempt some gargantuan regex for cases like this where you have to track context.

Sorry for the lack of indenting. MeFi kills indenting with pre, but inserts blank lines every other line with code. I thought the former was the lesser evil.
posted by Zed_Lopez at 2:20 PM on January 2, 2007


Response by poster: To: bastionofsanity--> See the thing is, I can figure out what quote goes w/ what field when I eyeball the line in question. So, how do I replicate what goes on in my brain in figuring out how to match quotes into some kind of code?

Alas, re-exporting data in this situation into a sane format is not an option as the source system is defunct... :(

Like I said, in the future, I'm going to refuse to work on any flatfile that is this "dirty," but I need to survive this project first... :(
posted by apark at 2:22 PM on January 2, 2007


Well, that's weird. There should be two single-quotes with nothing between them in line 12, not just a single single-quote. I cut and pasted from my editor and don't know how that happened.
my $field = join ', @field[1..$#field-1];

posted by Zed_Lopez at 2:30 PM on January 2, 2007


OK, now I know how it happened. MeFi eats double single-quotes on posting but not preview.
posted by Zed_Lopez at 2:30 PM on January 2, 2007


panix3% sed -E -e 's/=\"([^\"]*)\"/=\1/g' flatfile
"Field 1" "Field 2" "Field 3=Value" "Field 4=Value2 Extra Text"
"Field 1" "Field 2" "Field 3=Value" "Field 4=Value2 Extra Text" ""Field 5a" "Field 5b""
Still only half the problem. I like Zed_Lopez's perl solution if the database lets you enter escaped quotes, and if you want the quotes to stay.

Thanks for posting your puzzle. I needed an excuse to learn awk and sed.
posted by ctmf at 3:25 PM on January 2, 2007


How did you want the 5a/5b thing to turn out?
"Field 5a Field 5b"
or
"Field 5a" "Field 5b"

I have an answer if you like it the first way. (got it from someone else)
posted by ctmf at 3:57 PM on January 2, 2007



these are assuming that without the "" parse error, everything will import just fine.

vi/vim

:%s/"\+/"/g

//"" (search for doublequotes)

perl:

#!/usr/bin/perl -w

my $filename = "test.txt";

open( FILE, "< $filename ) or die can't open $filename : $!;br> my @f = ;
close FILE;

foreach(@f) {
print $_;
$_ =~ s/"+/"/g;
print $_;
}

perl is very good at dealing with those dirty flatfiles. a little bit of knowledge with regexp, arrays, etc, will go a long way.

okay, here's a slightly modified snippet of that which basically gives you a container full of the elements.

#!/usr/bin/perl -w

my $filename = "test.txt";

# read all of the file into the f array, close the file.
open( FILE, "< $filename ) or die can't open $filename : $!;br> my @f = ;
close FILE;

foreach(@f) {
print $_;

# regexp matching one or more doubleqoutes
$_ =~ s/"+/"/g;

# arrays are fun.
@l = split /" "/, $_;

foreach (@l) {
#strip off that extra doublequote. we dont need it.
$_ =~ s/"//g;
print("$_\n");
}
}

from there, you can.. well.. keep playing. the output from the first is probably what you're looking for in source-outcome format. the output from the second is a list of the elements.

yeahyeah, i shoulda used a hash instead.

posted by onedarkride at 4:01 PM on January 2, 2007


Is there any more structure/repetition/semantic relationships to the file than is indicated beyond what you posed in your original question?

For example, would "Field 1", for example, be drawn from a limited number of words? What about "Field 2" or "Field 3"? Is there one particular field that tends to have polymorphisms (like the field 4 and field 5's in the second example)? Or one particular field that is extremely repetitive? (like, say, M/F for gender)

Aside from things like the split field five, do the lines have the same number of "major fields" each?

Maybe if the data is structured enough and has obvious (to a human) demarcations you could spit it out to the "mechanical Turk". (which is an Amazon service for outsourcing a large number of tiny jobs. At $0.01 per line * 0.1 problematic line ratio * 500k lines = $500; well, I guess that is still really expensive)

Considering the generality of the problem, I tip my hat to all of the great solutions and thinking so far in previous posts.
posted by Maxwell_Smart at 7:27 PM on January 2, 2007


« Older What's the easiest way for my father-in-law to...   |   What can I do to improve my cellphone reception... Newer »
This thread is closed to new comments.