MS-SQL problem: exporting with long character columns
January 26, 2024 10:55 AM   Subscribe

I have a large MS-SQL database that I want to query and write the output to csv (or some other nice exchange format). One of the columns is text, and can be quite long (5000 characters). This is a real pain.

Because the data can contain various delimiters and newlines, it is important to use an actual csv export with proper quoting and not the default thing that MS-SQL does (insert commas, pray for the best). Sometimes I can fish around for a character that isn't used to use as a delimiter, but I'd rather not have to. I have always lived in linux and the windows db stuff is foreign to me. Yes, I am absolutely stuck with MS-SQL on windows.

I've gone through two sets of solutions:
- bcp does not seem to generate reliably compliant files with escaping where needed
- Invoke-Sqlcmd piped into Export-Csv seems to truncate lines at 4000-ish characters. I think this is a limit of the available version of power shell.

The datasets are too large to fit into memory. I am happy to use a python library as long as it doesn't realize the entire query. I have previously used SQLalchemy with some hacky and not terribly reliable chunking of the data and serial writing (ie other processes can touch the database between the chunked query). I think that I am going to go back to this (chunking the data by insertion-time month) even though the performance on the unkeyed db is pretty bad.
posted by a robot made out of meat to Computers & Internet (6 answers total)
 
I am more of a DB2 or MySQL kind of guy, but if it were me with one of those other DBMSs (not your case) I would convert to hex and then unconvert later in a language of your choice. I tried a quick Google and there is this Stack Overflow that seems to be along those lines. Just FWIW.
posted by forthright at 11:46 AM on January 26


It's been (indistinct mumbling) years since I had to do this, and Microsoft has a distressing tendency to move features around between similarly named tools, but are you sure you're using the right export tool for the job? Specifically, assuming you're using SQL Server 2022, it looks like you should be using the SQL Server Import and Export Wizard (if you don't yet have it, that page tells you how to get it). You'd select the output type flat file, and then the wizard should give you multiple options of how to format the flat file. Looks like you need to choose delimited, choose your delimiter from the given options, and then set the "text qualifier" option to wrap your text field in something (probably double quotes, but you'll need to test it with your actual data).

I know it's possible to make SSIS export an actual CSV with the proper quoting, but the fact it isn't the default was infuriating. I also remember making a habit of doing two separate exports, one with all the fields except the TEXT fields, and one that was just the primary key and the TEXT field. That way if the tools failed to format the data correctly, the file I had to clean up didn't have any other data I needed to worry about preserving as I poked at the file to figure out what happened and how I could export it again (or, better, automate my workflow) so that sort of problem wouldn't happen again. I think I made all the clients and vendors with whom we shared data standardize on pipe-delimited instead of comma-delimited files because in our case there were never pipes in the data, but again, you'll need to test that with your actual data.

Also, if the data in your TEXT fields is machine-generated (e.g. JSON or joined key/value pairs), it may be worth looking at extracting it using the same tools that put it there in the first place, but that's (again) going to depend on your specific data.
posted by fedward at 11:48 AM on January 26 [2 favorites]


Everything I can find says the max line length for both MSSQL export and PowerShell cmd.exe is 8K (8191 bytes actually), are you using UTF or similar double-byte character set? It would track.

So to this end, it appears VisualBasic Script is an answer. Luckily, MS only EOL'ed VBS in October, so you're on reprieve for now, but if this is going to be a recurring task you should look into a replacement like Python or something, probably DBI oriented.
posted by rhizome at 12:22 PM on January 26


Mod note: Comment removed. Please see the Content Policy about LLM-generated answers
posted by jessamyn (staff) at 11:38 AM on January 27


Response by poster: I'll ask the box admin about the wizard. The export drop down (sure looks like a wizard) was not giving me properly formatted exports, and I believe was the cause of me pulling my hair out every time I asked for data out of this legacy system. There was some complication that I was forced to either pick proper formatting or some other requirement, but not both. I think it was making it very hard to automate.

I am in a similar situation that there is only 1 field that is really a problem, so the trick of exporting key + bad field and post-processing it is a good one. Right now I'm using bcp and striping out errant delimiters and newlines in the query itself. I just feel like that's squishing special cases in a way that I will never win.

Since you say that, there are multibyte characters in the data. I think that I never care about them; I wonder if I can collate to a smaller code page and also be rid of the problem. After all, I'll never need more than 8K characters! I am not sure if the hex trick gets around the export length. I have never used VBScript and will eat a huge pile of python frameworks if I get to avoid it.

I didn't see the LLM answer, but I did ask Bard and chatgpt to see if they had answers.
posted by a robot made out of meat at 6:48 PM on January 28


For your exchange format, can you use NDJSON? I've found that to be useful for payloads with similar challenges.
posted by NailsTheCat at 2:04 PM on January 29


« Older Macbook Pro alternative to Apple's Music app?...   |   Is it covid safe to visit my grandma after minor... Newer »

You are not logged in, either login or create an account to post comments