Unix Search and Replace Question, of Course!
May 3, 2013 6:31 AM   Subscribe

Some of my data is having some extra carriage returns that is seriously harping my style, I'd like to use UNIX commands to fix the data so it can be loaded into tables. Have used sed, not working. Details inside:

I'm getting the following data (example, not actually my data), 4 columns, unfortunately comma delimited and surrounded by quotes. The data is also originally dos/windows, but that doesn't impact processing

As you can see, in the second record there's a carriage return in the middle, which is pushing part of the line to the next line:

"USA","WV","Charleston","48"
"USA","IA","Des Moines
","44"

"USA","WV","Charleston","48"

Expected results:

"USA","WV","Charleston","48"
"USA","IA","Des Moines","44"
"USA","WV","Charleston","48"



I'm trying to search for instances of \r\n", and replacing with ", but that doesn't seem to work. Help!
posted by sandmanwv to Technology (12 answers total) 1 user marked this as a favorite
 
Are you doing this on a Unix box?

I usually open the file in vi/vim in binary mode then type this:

:1,$s/^M\",/\",/g

To create the ^M, hit Ctrl+V then hit the return key. There is the forward slash in front of the double quotes so it is treated as a character.

To read that line in English, from line 1 to the end of the file ($), replace the string of '^M",' with the string '","' globally (multiple times per line if necessary).
posted by jillithd at 6:37 AM on May 3, 2013


Response by poster: I forgot to mention, needs to run in a script to automatically process, so any text editors are out (I'm able to do this in notepad++ manually). Thanks though!
posted by sandmanwv at 6:40 AM on May 3, 2013


use dos2unix. to strip the carriage returns?
posted by namewithoutwords at 6:47 AM on May 3, 2013


Are you seeing the ^M character in unix indicating that you have a \r\n? If not, I'd suggest:

cat file | sed ':a;N;$!ba;s/\n//g' | sed 's/\"\"/\"\n\"/g'

viz:

bfranklin@box:~/test$ cat test.txt
"USA","WV","Charleston","48"
"USA","IA","Des Moines
","44"
"USA","WV","Wheeling","48"
bfranklin@box:~/test$ cat test.txt | sed ':a;N;$!ba;s/\n//g' | sed 's/\"\"/\"\n\"/g'
"USA","WV","Charleston","48"
"USA","IA","Des Moines","44"
"USA","WV","Wheeling","48"

posted by bfranklin at 6:49 AM on May 3, 2013


You appear to be trying to naïvely parsing CSV. It's perfectly okay for CSV to contain newlines in fields; how else to get multi-line cells? Write a filter in your favourite scripting language (in my case, Perl and Text::CSV) to escape these data properly. Use someone else's frustration to solve your edge cases.
posted by scruss at 7:07 AM on May 3, 2013 [5 favorites]


Within the directory you want to search and change (.csv files):

find . -name "*.csv" -type f -print0 | sed ':a;N;$!ba;s/\n//g' | sed 's/\"\"/\"\n\"/g'
posted by LeanGreen at 7:07 AM on May 3, 2013


Have used sed, not working ... I'm trying to search for instances of \r\n", and replacing with ", but that doesn't seem to work.

The reason that doesn't work is that sed does all its processing one line at a time, and is never seeing \r\n" because the \n ends the input line; it's seeing a line that ends with Des Moines\r followed by another that starts with ".

Writing bfranklin's script :a;N;$!ba;s/\n//g out with one sed command per line makes it look like this:

:a
N
$!ba
s/\n//g

:a defines a label and doesn't process the input line in any way. N reads the next input line, and appends a newline plus that next line to the existing buffer contents. $!ba says "on every line that isn't the last, branch to label a" - so the first three commands will run over and over until the last line's been read, which means that the whole input file, complete with all newlines (except the one that terminates the very last line) ends up in the buffer.

The last command does a global substitution of newlines for nothing, wiping out out every newline in the buffer. So now the only way to tell where the original line boundaries are is by virtue of the fact that the quote that ends each line will be immediately followed by the one that starts the next line. The second sed command with the script s/\"\"/\"\n\"/g does a global insertion of a newline between every pair of quotes, which given your sample data does indeed restore the line boundaries to where you want them.

I think there's room for improvement there.

First up, the script doesn't allow for the fact that the original file came from DOS and therefore has \r\n line endings instead of just \n. That's easily fixed by changing the s/\n//g part to s/\r\n//g.

I also can't see any reason why s/\"\"/\"\n\"/g should need its own instance of sed; I would have added it to the end of the script for the first command.

But there's a bigger problem with the approach, which is that sooner or later you're going to strike an input file where one of the quoted fields contains a quote. The CSV way of dealing with this is to double the embedded quote: for example, a CSV field containing the text 12" album would look like "12"" album". Run that through bfranklin's script and it will insert a newline into the middle of that field, right where you don't want it.

A better way to deal with CSV data containing embedded newlines is to try ignoring them first; almost every tool that can import CSV data to tables already handles them properly, and you'll simply end up with Des Moines\r\n in one of your table fields.

If you really are working with a horribly broken CSV importer or you just want to clean out those newlines for neatness, your sanest path is obviously to use a standard CSV library as suggested by scruss.

But if you really, really want to roll your own, and you're determined to do it with sed, here's one way that should work reasonably well. It's based on the following assumptions about the CSV files you'll be feeding it:

1. The input file is composed of records, and the records are composed of fields.
2. Fields are either quoted or unquoted. If unquoted, they cannot contain quotes, commas or newlines. If quoted, they can contain anything at all but contained quotes must be doubled.
3. Commas followed by optional whitespace, occurring outside fields, separate fields within a record.
4. Newlines occurring outside fields separate records.

And given all that, the job of the script we want to write is to

a. Convert all \r\n sequences to \n to convert DOS newlines to Unix ones.
b. Remove all newlines that occur inside fields.

The first part is easy: we just look for \r at the end of each input line and kill it.

/\r$/s///

To get the second part right, I want to start by reading a whole input record into sed's buffer. So I need to detect the case where what's in the buffer is not a complete record because it's been terminated early by a newline embedded inside a field.

Fortunately, the only thing I need to look at to do that is the quotes: if there's an odd number of quotes in the buffer, the record is incomplete and I need to read more input lines to complete it (see rule 2). I don't actually care what separates the fields.

So I want a regular expression that matches a buffer containing an odd number of quotes. Probably easier just to negate one that matches an even number of them, because an even number is just any number of pairs. I don't care what, if anything, is between the paired quotes. So this should do it:

/^\([^"]*"[^"]*"\)*[^"]*$/

Translation: from the start of the buffer, match groups of (any number of things that aren't quotes, followed by a quote, followed by any number of things that aren't quotes, followed by a quote) any number of times, followed by any number of things that aren't quotes to the end of the buffer - where "any number" could be zero.

Whenever I've just read a line and the buffer doesn't look like that, I need to append another input line, get rid of the newline that sed will insert before it, and then do all that again on the resulting (possibly still incomplete) record. So here's the script:

:getrecord
/\r$/s///
/^\([^"]*"[^"]*"\)*[^"]*$/!{
N
s/\n//
b getrecord
}

which becomes the following shell script one-liner:
sed ':1;/\r$/s///;/^\([^"]*"[^"]*"\)*[^"]*$/!{;N;s/\n//;b1;}' in.csv >out.csv
and I think that will do what you want without screwing up any edge cases.
posted by flabdablet at 11:55 AM on May 3, 2013 [2 favorites]


There's a variant of the CSV format where rule 2 looks like

2. Fields are either quoted or unquoted. If unquoted, they cannot begin with a quote, and cannot contain commas or newlines. If quoted, they can contain anything at all but contained quotes must be doubled.

If that's what the source of your CSV files used, there won't always be an even number of quotes in a complete CSV record and the record matching regexp will need to parse the fields more carefully.

There are two valid record formats:

R0: no fields
Rn: one field followed by any number of (comma, optional whitespace, another field) groups

where any field can be either quoted or unquoted:

QF: at least one (quote followed by any number of nonquotes followed by a quote) group
UF: empty, or a nonquote followed by any number of things that are neither commas nor newlines

Extended regex syntax (no \ before grouping parentheses, at-least-one operator available) will be more convenient for this.

QF: ("[^"]*")+
UF: ([^"][^,\n]*)?

so a field will be

(("[^"]*")+|([^"][^,\n]*)?)

and a record will be

^((("[^"]*")+|([^"][^,\n]*)?)(,[ \t]*(("[^"]*")+|([^"][^,\n]*)?))*)?$

and the shell script one-liner becomes
sed -r ':1;/\r$/s///;/^((("[^"]*")+|([^"][^,\n]*)?)(,[ \t]*(("[^"]*")+|([^"][^,\n]*)?))*)?$/!{;N;s/\n//;b1;}' in.csv >out.csv
and if you can look at that and still not want to use a standard library instead then I think there's something deeply wrong with you.
posted by flabdablet at 12:45 PM on May 3, 2013


Actually that can be shortened a tiny bit by considering empty fields to be degenerate quoted fields rather than degenerate unquoted ones, which gives

QF: ("[^"]*")*
UF: [^"][^,\n]*

so a field will be

(QF|UF) which expands to (("[^"]*")*|[^"][^,\n]*)

and record with at least one field will be

^field(,[ \t]*field)*$

and since that first field can be empty and the rest can be missing, there's no need to make specific provision for an empty record. So the record regexp becomes

^(("[^"]*")*|[^"][^,\n]*)(,[ \t]*(("[^"]*")*|[^"][^,\n]*))*$

which shortens (!) the script to
sed -r ':1;/\r$/s///;/^(("[^"]*")*|[^"][^,\n]*)(,[ \t]*(("[^"]*")*|[^"][^,\n]*))*$/!{;N;s/\n//;b1;}' in.csv >out.csv
which is still not something you'd want to encounter in somebody else's code before you'd had your coffee.
posted by flabdablet at 12:57 PM on May 3, 2013


And there's a bug in that, because it ends up being able to match whitespace following a comma as part of an unquoted field, which lets
"USA", "IA", "Des Moines
match as a complete record when it shouldn't. I won't go through all the steps again but here's a version that deals with that:
sed -r ':1;/\r$/s///;/^(("[^"]*")*|[^"][^,\n]*)(,[ \t]*(("[^"]*")*|[^ \t"][^,\n]*))*$/!{;N;s/\n//;b1;}' in.csv >out.csv
Are you ready to take scruss's advice yet?
posted by flabdablet at 1:06 PM on May 3, 2013


You do not want to do this yourself. Take scruss's advice.
posted by dmd at 1:21 PM on May 3, 2013


If Python is a language you’re comfortable with, use the built-in csv module in the REPL. It’s like the Unix command line, but also like scruss’s advice.

(edit, here’s mine)
>>> with open('/tmp/in.txt') as input:
...     rows = filter(None, reader(input))
...     rows = [map(lambda v: v.replace('\n', ' ').strip(), row) for row in rows]
...     with open('/tmp/out.txt', 'w') as output:
...             outcsv = writer(output)
...             for row in rows:
...                     outcsv.writerow(row)
... 
The “filter” deals with a blank line at the end.
posted by migurski at 9:40 AM on May 11, 2013


« Older Dentist pulled four wrong teeth from my...   |   Ship a laptop from US to Poland Newer »
This thread is closed to new comments.