Why is Excel so bad at reading CSV?
April 21, 2022 8:45 AM   Subscribe

Anyone who has had to read data know that it's default behaviour is extremely obnoxious. It tries to convert everything into what it thinks the user might want, destroying information in the process. But... why?

This is something I've just treated as a fact for a while, and I'm fully aware of work arounds. But the default behaviour seems so transparently bad to me.

It seems like in most cases, when I have a CSV with something which might be a date, but could be a fraction, or a range, or any number of things, I should leave it in it's raw form, rather than make a decision for a user. Am I wrong?

Do some companies work flows really rely on Excel's bad behaviours? I can only assume why they have kept doing this for the last decade; and google sheets behave in the same way too!
posted by Cannon Fodder to Computers & Internet (21 answers total) 4 users marked this as a favorite
 
I've been a contract database/ETL developer for 24 years and I have never worked for a company that relied upon the aggravating Excel behaviors that you describe. Excel may be fine for front-end user processing but it destroys raw data and I avoid it in ETL processes at all costs. I do not have answer as to why it's so bad (like how all long numbers are formatted as scientific notation...WHY?!), except that it's designed more for layout and user interaction and less as an actual data processing tool.
posted by mezzanayne at 8:57 AM on April 21, 2022 [1 favorite]


Bad guessing vs. No guessing is a topic of endless debate for User Experience experts. The most convincing general argument I’ve heard is that the more sophisticated the user is the more they will gravitate towards a manual set up. From that standpoint an app might as well take a shot at automatically sensing all the fields as it will be nice for the grandmas, but force anyone with more complex fields to really spell out what they’re looking for.
posted by Tell Me No Lies at 9:07 AM on April 21, 2022 [4 favorites]


Historically Excel is a spreadsheet program which is designed for manipulating financial data. Therefore it defaults to names, dates, and quantities. It's not supposed to be used as a database and it has been resistant to changes which would make it a better database program (because that's not what it is). But as part of a bundle it is essentially free so like other Microsoft products it is used widely outside it's intended function. It's the same reason PowerPoint sucks for making posters and Word sucks for anything longer than a memo or a letter.

I happen to strongly agree with you that excel should change it's default behavior when it comes to dates but I don't think that is ever going to happen.
posted by muddgirl at 9:08 AM on April 21, 2022 [10 favorites]


The short answer is because it's not a program that's intended to display or manipulate CSV data. It just so happens that it is a convenient way for users to view that data in a program you already have installed and know how to use, so it gets used that way. When you use something in ways its maker didn't intend for you to use it, weird things happen.

My suspicion is that the people in position to change this at Microsoft or Google don't actually feel this as a pain point as much as casual users do because they're aware of (and often users of, and not infrequently developers of) more sophisticated programs to deal with CSV data.

As for your last question, the history of business pretty clearly shows that if an inferior process will save a few dollars, many companies will adopt the inferior process. Excel's CSV handling isn't even a particularly egregious example of this.
posted by kevinbelt at 9:28 AM on April 21, 2022 [1 favorite]


I've always found it annoying when spreadsheets don't detect dates as such on a CSV import.

I think OpenOffice Calc does it right: it gives you the option to detect special numbers right there on the CSV import dialog, and the default is whatever you chose last time. If Excel doesn't do the same thing, then I think that's wrong.
posted by flabdablet at 9:31 AM on April 21, 2022 [2 favorites]


Yeah, this stuff is a big part of my job, and my husband and I often use the shorthand phrase "excel crying" to explain being upset about software.

Excel certainly makes some parts of that process miles easier because I have to do all kinds of data manipulation to these files before dropping them back out as new and better CSV files, and I just accept sometimes having to start over because I didn't notice it ate my invoice numbers or accounting periods as the blood sacrifice this demon requires.

I definitely suspect there's no think-tank among Excel designers focusing on the csv end of the user spectrum, though they did just release some shit that data-scrubbers like me have wept over for years (mmmm TEXTSPLIT and XLOOKUP). I don't think they hate us, they just don't necessarily understand us.
posted by Lyn Never at 10:10 AM on April 21, 2022 [2 favorites]


It’s possible that Microsoft hoped that this would help native Excel formats to replace CSVs. It’s really likely that they wouldn’t mind if it did, and that the Excel team believed that everyone would be better off as skilled Excel users.
posted by clew at 10:10 AM on April 21, 2022


The big company I work for regularly uses CSV sheets into Excel for stuff- there was even a limitation of 64,000 rows (or something it's been a while) that we cut files to work around.
posted by The_Vegetables at 10:12 AM on April 21, 2022


LOL, genetics researchers have renamed genes because Excel kept converting them into dates. Horror Stories. Excel is evil!!!!!!

When Spreadsheets Attack! - YouTube

Evil Spreadsheets I Tell You! Pure Evil!

AAAAAAAARRRRRRRRGGGGGGGHHHHHHHHH!
posted by zengargoyle at 11:13 AM on April 21, 2022 [1 favorite]


Are you double-clicking the CSV file to open it in Excel? If you open a blank Excel spreadsheet, then click Data > From file > From text/CSV and browse to the CSV file, a wizard opens that allows you to set delimiters, field formatting, etc. as you please before displaying the data in the spreadsheet.
posted by Greg_Ace at 12:08 PM on April 21, 2022 [10 favorites]


Great question, which could be rephrased as "Why ... CSV ?"

I didn't learn quite how old, unspecified and repurposed over and over to get jobs done jankily but quickly CSV was, until working in France I encountered .csv files that were invariably semi-colon delimited, because the comma is always used for the decimal point.

I wonder whether the thought has ever been explicitly had at Team Excel that people who have to regularly work with CSV are likely to be the sort of people who'll take perverse satisfaction in knowing how to tricksy away the edge cases ... ?
posted by protorp at 12:33 PM on April 21, 2022 [2 favorites]


CSV's design being "totally text", it doesn't have anything within it to guide the program loading it. You could enforce an organization standard to include everything in quotes, which will treat each 'column' as a quoted string. That should eliminate Excel messing up your fields but you'll have to go and set the formatting on each column after you load it.

The DIF format was an attempt at including _some_ information about fields in the text file, but it is poorly supported if at all.

The Wizard mentioned above is worth trying.

Does anyone know of standards work for file format that would contain self-defining information in some sort of header records, and which would be worked on by all major database / spreadsheet vendors? Similar to calconnect.org which has made great strides on interoperability between calendar software .
posted by TimHare at 1:15 PM on April 21, 2022 [2 favorites]


I'd just been looking for that, TimHare, and was surprised not to find a hundred right off. I suppose they were mostly adhoc or in-house.

There is the 2015 W3C recommendation called "Model for Tabular Data and Metadata on the Web" although it's obviously the continuation of work on CSV files and uses CSV files as examples frequently. (Why not "Character-Delimited Data", since one of the problems is that no single delimiter character makes everyone happy? I expect there is a mile of discussion of this somewhere.)

It includes a datatype-specification section, suggesting that the TSV file embed a link to another file with the datatypes in it. Huh. I'd rather have a header row with the datatypes as fields, so little goblin mungers can still do something with files rescued from the fallen crystal citadels of the High Elves.

I expect there's another mile of discussion there, probably with fewer metaphors.
posted by clew at 2:59 PM on April 21, 2022 [3 favorites]


The data import wizard Greg_Ace mentions above is a better way of getting CSV data into a spreadsheet.

I have to open CSV files at work so often that I open them in the Visual Studio Code text editor with the Rainbow CSV plugin. It gives each comma separated field a color and justifies them so the fields are all lined up, without modifying the file itself. I never have to manipulate this text, just view it, but there may very well be utilities available for that as well. Some other text editors have similar plugins, and for my purposes they are far, far quicker and easier to use than Excel.
posted by lhauser at 5:36 PM on April 21, 2022 [3 favorites]


Microsoft is unlikely to ever change this behavior because they have a fetish for backwards compatibility. The tool purposefully miscalculates leap years because Lotus 1-2-3 did so 40ish years ago. As others mention, the tool gives you options to deal with this but it is more steps.

Don’t get me started on what German Excel does to CSV files.
posted by mmascolino at 8:33 PM on April 21, 2022 [2 favorites]


What flabdablet said.

Haven't used excel in yonks, OpenOfficeCalc works fine.
posted by james33 at 5:26 AM on April 22, 2022 [2 favorites]


At the time Excel implemented it, CSV was an unspecified, ad hoc format. There wasn't yet an RFC 4180 or similar for users to expect it to abide by, and other systems that used something like CSV were wildly different and incompatible.

In this environment, Microsoft were less interested in the complicated, annoying particulars of making their CSV import/export interop with competing spreadsheet apps than they were with being able to round-trip typed data, in order to use CSV as a storage format for purely-Excel spreadsheets.

(And if you accidentally parsed something as a date that would probably be largely OK, because it would be presented in the same format as it started out in. Because your company's machines all had a single locale and didn't talk to systems in other countries. Internet? What's that?)

It would be nice for Excel to have changed this behaviour some time in the last few decades, but Office has... substantial inertia. Changing it probably would confuse the workflow of someone who's working within Excel with CSV as backing format; and those of us who care already have deeply-ingrained workaround workflows.

And consequently the horror sustains indefinitely
posted by BobInce at 8:26 AM on April 22, 2022 [2 favorites]


The Tragedy of the Commas
posted by Greg_Ace at 10:06 AM on April 22, 2022 [5 favorites]


Yeah, the inertia...plus everyone trying to design a new format means this is going nowhere. The beauty of csv was it was both simple for a human to read/write as well as easy for a computer to generate/consume. Invariably, any new format is going to be way more complicated and thus will struggle to be adopted by humans or computers.

I had a thought that could solve this. If Microsoft would adopt an additional format called say "Raw CSV" which by convention used the extension *.rcsv, they could easily program Excel to just split the data into columns based on the delimiter and do no guessing/formatting/manipulating of data. It would be a struggle to get Excel to make this change but if they did support it, it would be trivial for people to create rcsv files instead of csv.
posted by mmascolino at 8:48 AM on April 24, 2022


I don't think new file types are necessary, nor even a good idea.

If you open a blank Excel spreadsheet, then click Data > From file > From text/CSV and browse to the CSV file, a wizard opens that allows you to set delimiters, field formatting, etc. as you please before displaying the data in the spreadsheet.

Seem to me that exposing that very same wizard could easily be made Excel's standard behaviour when opening a CSV file, just as it is in OpenOffice/LibreOffice Calc.

Also seems to me that setting that wizard's initial defaults to implement the exact same conversion rules that Excel currently uses for opening a CSV file with a double-click, so that the only extra burden imposed on anybody who actually likes those rules would be one extra click on an OK button, would do nobody any harm.
posted by flabdablet at 1:56 PM on April 24, 2022 [1 favorite]


But I think Microsoft has shown to loathe changing widely used default behavior. Just think of all the books, blog posts and discussion board posts that will be wrong if you change anything.
posted by mmascolino at 4:21 AM on April 25, 2022


« Older Migrating to a new windows PC.   |   Recipe for Carob Bites Newer »
This thread is closed to new comments.