Excel delimited text import FAIL
February 25, 2009 1:59 PM   Subscribe

Excel is chapping my hide. I have a comma-delimited .csv which contains quote-demarcated text fields. The source of my never-ending joy is the recurrent use of inch-marks and commas within the textfields. Right this way...

EXAMPLE

TEXTFIELD1, NUMFIELD, TEXTFIELD2
"3" widget, black",1,"3" widget, black"
"Assorted widgets, 3"",1,"Assorted widgets, 3""

this renders in Excel after careful use of the text-import wizard (using backslash-tab to represent cols)


TEXTFIELD1\tNUMFIELD\tTEXTFIELD2\n
3 widget, black"\t1\t3 widget, black"\n
Assorted widgets, 3",1,Assorted widgets, 3""\n

The semi-random removal of in-text-field quotes is repeatable, so there are actual rules being applied.

Excel CAN'T be this brain dead, can it?

I mean, .csv and comma-delimited are OOOOOLD standards. Freeform textfields containing inch-marked and comma-ed product data has to be the very oldest text-and-data management problem in the book.

How can I slap the spreadsheet around to obtain proper data I/O?

PLEASE HOPE ME!
posted by mwhybark to Technology (18 answers total) 1 user marked this as a favorite
 
csv is not actually a standard. to quote the wikipedia article on the subject "No general standard specification for CSV exists". what version of excel are you using? i ask because not all versions handle csv the same way. what happens if you open the file rather than importing it? i have seen the two work flows produce different results in excel 2007.
posted by phil at 2:07 PM on February 25, 2009


adding to my original post. many products require you to wrap columns that contain the delimiter in quotation marks. is it possible that this is your problem? if so i believe you can escape the quotation mark with a \
posted by phil at 2:20 PM on February 25, 2009


I'm pretty sure Excel expects quotes within a quote delimited field to be escaped by placing another quote before it. So instead of a field having

3" widget
it should have
3"" widget

Is the .csv generated so that you can get a fixed copy of it? I'm not sure that Excel will be able to correctly import the version you have, and it might be tough to fix with some sort of text search and replace.
posted by ghostmanonsecond at 2:28 PM on February 25, 2009


i really wish i could edit past posts. what i was trying to say is that excel may be treating the quotation mark as a special character. it is frequently used to escape data that contains carriage returns or the delimiter. if this is the case you will need to escape the quotes when you want them to be treated as data.
posted by phil at 2:30 PM on February 25, 2009


You've got improper csv.

Try replacing every " that's NOT followed by a comma (that is, all or nearly all your inch marks we hope) with something not in your data like the string BOOGER.

Then replace all BOOGER with "" and try to import that.

This won't catch cases where an inch is followed by a comma. Scan for those and repair by hand.
posted by fleacircus at 2:40 PM on February 25, 2009


I tried creating and importing a csv file using your sample, and by using these steps:

-In step 1, select 'delimited'
-in step 2, select 'comma' as the delimiter, and " as the text qualifier
-in step 3, use the defaults

I got this:
TEXTFIELD1\tNUMFIELD\tTEXTFIELD2\
"3" widget, black"\1\"3" widget, black"\

which I assume is what you're after?

If you want to remove the quotes around "3" widget, black", after doing the above, the easiest way to do this is to use find-replace on black" -> black, and then use the REPLACE worksheet function to get rid of the " at the front of each text string.

The other option would be to open the .csv file in a text editor and remove all the quote marks, then once imported into excel, use CONCATENATE to join 3, ", and widget into a single text field.
posted by girlgenius at 2:49 PM on February 25, 2009


Just so you know, I have the exact same problem with commas in comment text fields. It makes me crazy. Watching.
posted by kimdog at 2:49 PM on February 25, 2009


Just so you know, I have the exact same problem with commas in comment text fields. It makes me crazy. Watching.

Are you enclosing your comment field's in quotation marks (or another data-appropriate mark?) If you are, commas shouldn't be a problem (thats the whole point of enclosing fields in quotation marks, so you can have commas in your fields)
posted by missmagenta at 2:54 PM on February 25, 2009


Response by poster: the data is sourced originally from a product database maintained by non-technical people. it was imported into an inventory-management product. the original database includes a column for manufacturer partnumber which must be manually joined to the imported data. The datafile I am working actually originates as a csv export from the inventory management app.

the inventory management product uses TEXTFIELD1 as the primary key and will be synching back to the original product database; therefore the data in TEXTFIELD1 can't be munged to correct the use of " and ,

I have my own thoughts about using a free-form text-based field as a primary key, but that has nothing at all to do with this particular problem.

the product developers have provided a 'supported' workflow to join the fields based on reimporting, you guessed it, via comma-quote delimited csv ("open the exported CSV with Excel," the instructions begin). So I suspect thet even should I get the file set up properly importing it will blow up. But. first things first.

girlgenius, doing this does retain the "" pairings. But it created new exception records as well.

fleacircus et al, enforcing the doublequotes seems promising, I will take a look at that.

phil, I am using Excel 2007. "Open"-ing the file does not invoke the text-import wizard. "Import"-ing the file does, and it seems to be the same text-import wized I have come to know and loathe lo these many years. It is certainly better than juts "open," though.
posted by mwhybark at 3:53 PM on February 25, 2009


I'm skeptical that you will get Excel to handle this without munging the data. The reason is that I don't see how Excel is being braindead. The only way to tell in your example that

"3" widget, black",

Should be one column containing the text 3" widget, black [column break] the next column is by context, by understanding that "three-inch widget, black", makes semantic sense and no other reading of these symbols does. Excel needs the double quotes to escape the commas in single columns, and it needs the commas to break up the columns. There doesn't seem to be any reasonable rules-based way of handling your problem, as far as I can tell. (I say 'reasonable' because depending on the data, I guess it conceivable that you could come up with a whole set of rules that would cover all your current cases, like: "ignore closing quote when immediately trailing a digit AND followed by a space". But the general case of having free text fields with unescaped double quotes is not algorithmically solveable, afaict.)

I'm not saying this to be a dick, I'm saying it in the hopes that if you agree with me, you can at least prune "futzing with Excel import tools" as a solution path. Basically, if you have unescaped delimiters in your file, there is no way for any tool to know what is a delimiter and what is an unescaped data value.

How about this idea: take your file, append a new, temporary, numeric primary key. Save a file that is a proper CSV mapping the new numeric primary key to the bogus TEXTFIELD1 primary key. Now, munge TEXTFIELD1 to be a sane useable column using one of the approaches mentioned above. Do whatever you have to do in Excel or with your join or whatever. When you are done, drop your munged TEXTFIELD1 column, use the temporary numeric primary key and the mapping file to join back on the original, insane TEXTFIELD1, drop the temporary numeric primary key. Done.
posted by jeb at 4:30 PM on February 25, 2009


Wow, it's been a long time since I've had to deal with really crappy CSV files, but everyone above is right - that's really crappy CSV data.

On the off-chance you have access to Perl, Text::xSV was always there for me to help me in my times of crappy-CSV need. It was a long time ago, but I do remember it helping me a lot with problems such as this one.
posted by swngnmonk at 4:41 PM on February 25, 2009


I haven't used Excel in a long time because open source spreadsheet apps have fit my needs. (Which, however, don't usually include printing and print layout - Excel is very good at that and in my experience OS spreadsheet apps aren't.) So my rather crude suggestion is, maybe just download things like Gnumeric and Calc, et cetera, and try opening up your data there, and if it's successful just save it as an Excel file.
posted by XMLicious at 4:56 PM on February 25, 2009


Can you get the inventory app to output its data using some other kind of delimiter?

Are the text fields by any chance equal length?

Ahha! I believe I got it. Open up your source file in notepad and use find and replace to replace all instances of

", with ";
and
," with ;"

Then import the csv using semicolon as your delimiter, and not using quotes as the text qualifier.
posted by gjc at 5:27 PM on February 25, 2009


from what i'm seeing here, it looks like the data should have been sent in a fixed field format. It's possible that it was, and that somewhere through it's handling, a program has truncated the open spaces. you might want to explore that end. it might have also been tab delimited at one point.

assuming that's a dead end, this is how i'd do it. i use access, though there may be others who could accomplish it in excel. how long it's going to take depends on how many records there are in the data.

i'd import it into access using the comma delimited factor option. i would not flag the quotes as text. when i'd import it, i would allow access to create an indexed autonumber field.

next, I would create a new table with the desired structure, plus an indexed integer field to match the one i created in my data import. I would then run an append query, and insert the autonumber field and only the first field of data from my imported table into the indexed field and textfield1 respectively.

i then would review the data in field 2, and identify some kind of criteria that can help me ascertain whether the data in that field was associated with the textfield1 or numfield. i would then use an update query to modify the textfield1, or to send the data in field 2 that belongs in numfield into the numfield column in my new table.

repeat the last step until you have reassigned all the values from your imported table into your new table. somewhere in the process, i would remove any errant commas or quotes. if your data has a lot of data that is out of range, or contains a lot of typos, it will delay, or perhaps render this process ineffective.

by the way, if the data is really as simple as your example, it may be possible that your delimiting characters are ", and ,". replacing those two characters with a single new character may also resolve your issues, or at least make the importing method i described above a bit easier.
posted by lester's sock puppet at 5:28 PM on February 25, 2009


Best answer: You know that:
- All double quotes preceded by a comma or at the beginning or a line are text delimiters;
- All double quotes followed by a comma or at end of line are text delimiters; and,
- All other double quotes are not text delimiters.

So, here's something you can try:
- Search and replace all instances of ," with some unique string, say *quote*.
- Search and replace (you will need regular expressions) all instances of " at the beginning of a line with the same unique string.
- Search and replace all instances of ", with the same unique string.
- Search and replace (you will need regular expressions) all instances of " at the end of a line with the same unique string.
- Search and replace all remaining " with a different unique string (say *inches*).
- Search and replace your first unique string with ".
- Import the file into Excel.
- In Excel, search and replace the second unique string with ".

Voila!
posted by Simon Barclay at 5:37 PM on February 25, 2009 [1 favorite]


BTW, in my steps 1-6 you can use any text editor that supports regular expressions. In a pinch, Word would probably work, since it supports a variant of regular expressions (you need to click More-Special in the replace dialog) and you don't need to know regexp.
posted by Simon Barclay at 5:40 PM on February 25, 2009


Just to reinforce the point, it isn't Excel that's braindead here, it's the people who made the CSV file for you. The file itself is crap, and that's why your hide is chapped. Get the file in a different format, such as fixed-length fields or semi-colon separated fields, and your problems will vanish. You need to cozy up to the people who made the CSV file and gently explore ways to get their app to output your file in this better way. If they don't understand what you're talking about, you may need to read the documentation for their app and figure out enough about its operation to show them how to do it. (Granted, this is Not Your Job, but if it gets you what you need, and they can do it your way in the future, the quality of your life will be improved.)
posted by exphysicist345 at 7:01 PM on February 25, 2009


Response by poster: "ignore closing quote when immediately trailing a digit AND followed by a space"

I was hoping that I might be able to construct a character-class aware import routine within the text-import wizard, but as noted above, it's the same old same old.

I ended up doing exactly - and i mean exactly - what Simon suggested, with additional exception strings called out such as +quotex+ +commaspace+ and the like.

i ended up with about ten exception records out of about 4000, not terrible, did the cut-and-paste join (well, technically i did a VLOOKUP to throw out any unused mfr PNs) and re-exported from excel as a CSV.

The Exel-generated CSV was sane and ONLY applied quotes to fields with the problem internal commas and inchmarks.

Of course, when i followed the mandated import instructions, the part numbers still didn't populate even though the datafile is now correctly assembled. At least I have a nice big billy club to pound these guys with now. Let's hope they left their pads at home today.
posted by mwhybark at 12:56 PM on February 26, 2009


« Older Unknown bugs bugging.   |   Gear needed and/or methods to use to take a decent... Newer »
This thread is closed to new comments.