What deep magyck is Excel wielding to make my CSVs work?
May 14, 2009 7:25 AM   Subscribe

This is a long question about CSVs, Excel, Imports and byte-sizes. Here's the rub: I have two completely identical files. They contain exactly the same characters in the same order. One has been opened and saved (Not altered in any way) in Excel. The other has not. The one opened in Excel is 20 bytes larger. The ERP will only accept the one saved in Excel. Why? More inside.

So, Magento, the e-commerce platform, posts two CSVs everytime someone places an order- One for customer data, one for order data. Right now, we're just looking at the customer data file.

So, it gets this file #_customer.csv. Our ERP, Sage X3, will not take this file. If I export some customer data from Sage, it gives me a file IDENTICAL in all visible ways to the one I tried to upload.

However, if I take that file, open it in Excel and just save it, suddenly it works. The contents of the file does not change, but it does become 20 bytes larger.

Excel has to be doing something, but what? Does anyone have any experience in this? I originally thought it might be an encoding issue, but I double and triple-checked- The Excel saving is the only variable.

This is everything you need to know, but I'm going to describe a few tests I went through below to get to this point just to be thorugh:

The file output, from Magento, looks like this:
A Customer# Order# Category
B First Last Address1 Address 2 Address 3 State Zip Country
D SKU Description Quantatity
D SKU Desciription Quantaity
etc etc.

If I tried to feed that to Sage (Which would Export a file exactly like this), Sage would not take it. If I saved it in Excel, Excel would add a number of empty junk fields so that the number of fields in every row was equal to the number of fields in the largest row MINUS one. In our case, the largest row has TEN fields, so the first row would get 5 extra, empty fields (,,,,, etc.) so that it had a total of NINE fields. D row would get 5, too. A row with 3 would get 6 etc etc.

So, I thought, "Well, that's weird, but whatever. I'll just output x # of junk fields in my CSV". And so I did that. And I got a file visually identical to the file that Sage would accept, thinking that the problem was these junk fields. Still, Sage wouldn't accept it. I had to open it, Save it in Excel and then upload it. It adds about 20 bytes to the file and it works.

The common denominator between these tests: Open and Saving in Excel, adding the 20 bytes. What is Excel up to? I have driven myself to the brink of insanity. Because this needs to be an automated process, I can't open and Save hundreds of files a day. Is this an encoding thing? Help! Save me from CSV hell!
posted by GilloD to Technology (34 answers total)
 
Response by poster: FYI: This one DOES NOT work:

B,CON, ,BIL,NTX,CRC,MAIL
A,BIL,Billing Address,85 North Street, ,Scott,Billyburg,New Jersey,07223,US
A,SHI,Shipping Address,85 North Street, ,Scott,Billyburg,New Jersey,07223,US
D,BIL,Scott
D,SHI,Scott

This one DOES work:

B,CON,,BIL,NTX,CRC,MAIL
A,BIL,Billing Address,85 North Street, ,Scott,Billyburg,New Jersey,07223,US
A,SHI,Shipping Address,85 North Street, ,Scott,Billyburg,New Jersey,07223,US
D,BIL,Scott
D,SHI,Scott

I'm using Notepad++ to view them in Text.
posted by GilloD at 7:29 AM on May 14, 2009


Response by poster: It should also be noted that the ERP hates anything with "Quotes" around it.
posted by GilloD at 7:30 AM on May 14, 2009


You need to 'diff' these two files against one another. Google 'diff' and your OS.
posted by unixrat at 7:33 AM on May 14, 2009


if you could post the actual files (with dummy data of course) we could figure out what & where those extra 20 bytes are.
posted by polyglot at 7:33 AM on May 14, 2009


Best answer: Well, the only difference in the text between the two examples you post is that the second one does not have a space for the empty value between CON and BIL.

A unix utility like "diff" would be very useful here to determine exactly what the difference between the files is. Can you post the example files somewhere (perhaps using senduit.com or similar) so that we could examine exactly what is in those 20 bytes?
posted by yourcelf at 7:33 AM on May 14, 2009


haha, preview. hrmph.
posted by yourcelf at 7:34 AM on May 14, 2009


Response by poster: The field is a customer # that gets assigned by the ERP. If I fill it in, it'll "Update" the record. It's fine to leacve it blank. I'll post the files in a few
posted by GilloD at 7:36 AM on May 14, 2009


Best answer: I'm a little confused by the headers. Why exactly is one of the column names blank?
It looks like in the version that works nothing is there, so maybe the import function is ignoring the second comma and treating BIL as the third column. In the one that doesn't work it's trying to create a column with the name ' ' which is certain not to work.

If there is any way to get that third column to have an actual value in it (not an empty space) that might improve things.

In theory, on a computer if two files really are identical they should be treated identically. You say that the customer data exported from Sage is identical to #_customer.csv. The truth is, looking at it in Notepad is not sufficient. There is other data that may be ignored or removed by a text editor.

The best thing to do would be to get a Hex editor and open the two files. This will show you ALL of the contents of the files, including things like null characters or control characters.

Diff may work as well, but the advantage of a hex editor is you can see *exactly* what characters the other file has that yours doesn't.
posted by Deathalicious at 7:37 AM on May 14, 2009


I'm guessing the line-ending codes are changed from one 'character' to two. You cannot see them in a normal text-editor like notepad. Run a diff, like windiff.
posted by Psychnic at 7:38 AM on May 14, 2009


Any chance that the file is 20 lines long, and that Excel is converting the end-of-line marker from a newline to a carriage-return + newline? That's a common issue with files generated on some flavor of Unix and consumed on some flavor of Windows.
posted by dws at 7:40 AM on May 14, 2009


Best answer: You cannot see them in a normal text-editor like notepad

But you can see them in Notepad++. In the View menu, click on "Show End of Line", and it will put CR or CRLF at the end of each line. I also suggest "Show White Space and TAB" to see the other invisible characters that might be different.
posted by smackfu at 7:43 AM on May 14, 2009


Best answer: Maybe it's in the way the files are saved - Unix format vs. Windows/DOS format. IIRC, Unix ends lines with a LF, while Windows/DOS ends lines with a CR & LF. Most modern viewers auto-detect the format, so they look the same, but they aren't. Excel might be converting from Unix format to Windows/DOS format. The extra CRs are a byte each, so you'd have 20 more bytes for 20 lines of data.

You can confirm by opening the files with Windows' Notepad, which is dumb enough to not automatically convert. If your originial file looks wierd (ie: all crammed in what looks like one line), you know that it's saved with LF's only at the end of the lines.

In Notepad++, you can probably check for the option to open/save as Unix vs. Windows/DOS format.
posted by Simon Barclay at 7:43 AM on May 14, 2009


Excel adds crlfs to line breaks
posted by jeb at 7:44 AM on May 14, 2009


Response by poster: http://asilenttreatment.org/sageexport/ForMeFi/59_customer.csv Does NOT work
http://asilenttreatment.org/sageexport/ForMeFi/59_customerSAVEDWITHEXCEL.csv THIS WORKS (You'll note it lobs the 0's on the front of the Zip Code. This is okay for testing.)
http://asilenttreatment.org/sageexport/ForMeFi/FromSageAGAIN.csv (From the ERP, Works. You'll note some minor discrpeancies (United States instead of "US", again, these don't matter)
posted by GilloD at 7:44 AM on May 14, 2009


Response by poster: Also, none of that data in those files is actual, so, no, you cannot break into my house and STEAL MY BRAIN!

Anyway, a ton of great responses in here. I'm going to start at the top and work my way down until something works. Thank you very, very much.
posted by GilloD at 7:45 AM on May 14, 2009


You cannot see them in a normal text-editor like notepad. Run a diff, like windiff.

Actually, in my experience Notepad (not Notepad++) actually goes a little haywire when it opens files that don't have the right kind of line breaks. So if you opened it in normal Notepad rather than Notepad++ (which probably detects line endings) you might notice one looks different than the other.

As an aside, how would you use diff to detect different line endings? I just did a diff on two files (one with Unix and one with Windows line endings) and got the following output:
1,2c1,2
<>
<>
---
> one
> two
Which is not very useful.
posted by Deathalicious at 7:45 AM on May 14, 2009


Response by poster: Also, I have a strong feeling that this has to do with line-ending characters, I just wan't sure what to look for.
posted by GilloD at 7:46 AM on May 14, 2009


Should have previewed...
posted by Simon Barclay at 7:48 AM on May 14, 2009


Well, diff is not really going to work if the data in the files (ERP vs export) aren't the same.

One thing that jumps to the fore: the export file that doesn't work doesn't have a newline at the end of the file. That might be the cause of the error.

If so, that is definitely something that Magento should fix and you should contact them about that.
posted by Deathalicious at 7:50 AM on May 14, 2009


Diff is pretty hard to understand for non-programmers. There are a bunch of free versions of Unix's 'od' (octal dump) command floating around -- search for 'octal dump windows' in google.

od has some slightly arcane output, but all you really need to know is that it's showing you the ASCII codes for each byte in the file. From there it is a matter of poring through the files to determine what additional codes there are, and what they mean. My money is on the file being 20 lines long, and there being a CR LF at each line end rather than just a CR. So all 015s are being changed to 015 012 (see http:///www.asciitable.com).
posted by felix at 7:55 AM on May 14, 2009


Response by poster: Well, diff is not really going to work if the data in the files (ERP vs export) aren't the same.

One thing that jumps to the fore: the export file that doesn't work doesn't have a newline at the end of the file. That might be the cause of the error.

If so, that is definitely something that Magento should fix and you should contact them about that.


It's actually popping out via an XLS template. Can I just drop the return at the end of the line there?
posted by GilloD at 8:02 AM on May 14, 2009


Response by poster: Of note: Notepad++ notes that MY version has a "CR" return, the working one has an "LF" return. So, the question is- How do I fix this? I can control everything in the output from Magento, do I just need to drop some ASCII in there?

Now I remember why I dropped CompSci for Philosophy.
posted by GilloD at 8:07 AM on May 14, 2009


Best answer: felix is correct - in the non working version, each line ends with 0D or CR. In the working version, each line ends with 0D 0A or CR LF. I viewed both files in Hex Edit mode in UltraEdit, and that's the answer.
posted by syzygy at 8:11 AM on May 14, 2009


It definitely is a newline problem. the only differences between those files are:

CR newline on the non-working file
CR LF newline on the working file

Also excell has helpfully removed the leading 0 from your zip code.

You probably need to add a \r\n in place of the \r in the output template.
posted by Nothing at 8:13 AM on May 14, 2009


Response by poster: Yep- Just did my own Hex Dump (Home sweet home? Kind of. More like you can never go home again. Anyway-) and saw the 0d/0a thing. In the XSL template I just had it dropping a blank line (13;) which is probably causing all kinds of trouble.

I'm still not 100% sure what the answer is here.
posted by GilloD at 8:17 AM on May 14, 2009


Response by poster: Or, rather, I know what the answer is, not how to implement it via the XSL template.
posted by GilloD at 8:19 AM on May 14, 2009


Not an XSL expert, but I believe XSL leaves line endings up to the operating system. You might write a very quick script, or download a text converter or something, to act as an intermediate step; there are many, many examples all over the internet for how to do this in your favorite language as it's a very common problem.

Or, if you were sufficiently lazy, you could download Unix2Dos from any number of locations -- for instance, there appears to be an implementation at http://www.bastet.com/.
posted by felix at 8:33 AM on May 14, 2009


I believe you can control the newlines output from XSL using <xsl:output encoding="(something)"> node, though I don't recall off the top of my head what encoding you'd want.
posted by ook at 8:39 AM on May 14, 2009


GilloD: Also, I have a strong feeling that this has to do with line-ending characters, I just wan't sure what to look for.

For those playing at home who'd rather not wade through the Wikipedia article, and for simplicity and brevity's sake, a small summary:

Plain text files (of which CSV files are one type) are identical on Unix-like and PC systems, with one glaring difference1: the end of every line in a Unix plain text file is marked by a newline (or linefeed, represented, for example, in Notepad ++ as LF) character, whereas the end of every line in a PC plain text file is marked by a return (or carriage return, represented CR) character and a newline character2. Given the commonness of Unix-like systems, especially in things like database functions, .csv file formats often get switched back and forth between these two different types; it's simple enough to automatically replace every LF with CR LF—and it only adds a few bytes to the file—but sometimes people (and even computer programs) forget. This causes some incompatibility, and sometimes programs can't recognize one or the other type of file.

For what it's worth, I've had some luck with this tiny, simply context-menu-based converter.

__________________________________________________________
1. ...and until OS X, Mac-style plain text files all had only carriage returns. I'm telling you, sometimes this cross-OS weirdness is just stupid.

2. This dates back to around 1972, when, during the old ARPAnet days, the CR LF standard was agreed upon. See here.

posted by koeselitz at 8:40 AM on May 14, 2009 [1 favorite]


GilloD: Of note: Notepad++ notes that MY version has a "CR" return, the working one has an "LF" return. So, the question is- How do I fix this? I can control everything in the output from Magento, do I just need to drop some ASCII in there?

Starting simply, you can fix this within Notepad++ by doing this:

  • (If you haven't selected “View”→“Show All Characters” yet, you should, so that you can see what's happening.)
  • Bring up the “Find and Replace” dialog by pressing control-h.
  • Select “Extended.” This allows you to use simple symbols for some characters that don't have their own keys, like LF and CR.
  • Replace whatever you'd like, keeping in mind these symbols:

    CR = \r
    LF = \n

  • For example, if you want to make this work in Sage by removing all the CRs, just put \r in the “Find what:” field, leave the “Replace with:” field blank, and click the “Replace all” button. Or, to convert back, replace all \n with \r\n. Simple.

  • posted by koeselitz at 8:52 AM on May 14, 2009


    Response by poster: The problem was that this needs to be happening AUTOMATICALLY. There will be several hundred of these a day.

    Anyway- Not really a solution, but I'm being told by our ERP tech dude that I can swap the expected line-ending character in the ERP itself.
    posted by GilloD at 8:55 AM on May 14, 2009


    Best answer: SO, A RECAP:

    The two files were visually identical. However, on a Hex/Encoding level, they were not.

    Basically, my files from Magento were coming out with CR linebreaks. If I opened them with Excel, Excel said "CR? Man, we use LF around here" and changed the linebreaks to LF.

    Sage said, "CR? WTF? I only know LF". Hence the problem. The solution? I was able to tell Sage to expect "CR" and not "LF".

    Many, many thanks to the EggHeads at AskMefi. I used the following stuff to get to the bottom of this:

    Notepad++ for finding out what the linebreak characters in the file were.
    Free Hex Editor Neo to discover their Hex Values (0d vs 0a, CR vs LF).
    Asciitable.com to discover that those were \013 and \010 respectively.

    Problem solved. CASE CLOSED.
    posted by GilloD at 9:15 AM on May 14, 2009 [1 favorite]


    Thanks for the wrap-up! I wish people would do that more regularly.
    posted by McBearclaw at 9:33 AM on May 14, 2009


    Glad to hear it all got worked out.

    I don't know if you were hoping to batch-convert—that is, if you wanted to do several hundred conversions once a day—or if you wanted this to happen automatically, but I was bored at lunch, so I whipped up this handy little program that batch-converts files between LF-line-terminated, CR, CR LF, and even LF CR (if that's something you'd ever want). I haven't tested the upper limits, but I imagine you can convert something like a few thousand files the size of the one you've shown us at a time; just select them all, and it'll put the converted files in a new folder called “converted.” I wrote it in AutoHotkey (still learning Python, or else it'd have been a lot easier, heh). Here's the source code.

    I know this isn't likely to be any help to you, since you've already figured it all out, but hey, maybe it'll help somebody else. Anyhow, there you go.
    posted by koeselitz at 1:24 PM on May 14, 2009


    « Older Seeking blueprint drawings of windows and doors...   |   Help me beautify my front yard. Newer »
    This thread is closed to new comments.