Reading a CSV file in R, problems with escaped quotes?
October 29, 2014 11:19 AM   Subscribe

I'm beginner-level with R, trying to work with a CSV file. I've noticed some rows are being interpreted incorrectly, and would like some help figuring out a way around that. I think the problem is escaped quotes not being parsed how I would like.

This reads in successfully,

id,"timestamp","text",user_id
1,"2014-10-21 00:00:00","Perfectly fine text",10
2,"2014-10-21 00:00:00","Text with two \"quotes\" in it",20
3,"2014-10-21 00:00:00","Perfectly fine text",30


At least it parses the columns correctly, as shown by
read.csv("example.csv",allowEscapes = FALSE)$user_id;
it correctly reads in the user_id column which follows the text column. Though an odd thing is the text displays as Text with two \\quotes\\ in it instead of as Text with two "quotes" in it which you might expect. And if I set allowEscapes to TRUE, it displays as Text with two \\quotes in it which I don't quite get.

The bigger problem is something like this:

id,"timestamp","text",user_id
1,"2014-10-21 00:00:00","Perfectly fine text",10
2,"2014-10-21 00:00:00","Text with one \"quote in it",20
3,"2014-10-21 00:00:00","Perfectly fine text",30


This somehow leads to row 2 gobbling up row 3 to be in its text value, and then its user_id is NA.

Is there something important I'm overlooking that will explain how to correctly read a CSV file in this format?
posted by RobotHero to Computers & Internet (9 answers total)
 
Can you say a bit more about why you want to read large amounts of text into R? My general rule is to do all my data cleaning in Stata (or even Excel), and then only read it into R once things look nice and tidy. R doesn't deal so well with odd characters, etc. so in general I try to avoid it whenever possible.
posted by rainbowbrite at 11:37 AM on October 29, 2014


I've never tried doing this in R, but the first answer on this Stack Exchange post suggests that read.csv() doesn't work well with escaped quotes.
posted by alex1965 at 12:07 PM on October 29, 2014


Response by poster: I'm not familiar with Stata. I don't have Excel but do have Open Office and trying to open the file in there says I've exceeded the maximum number of rows.

I had assumed since it was a csv file that a function called read.csv would be able to read it.

I've now come across this, which gives me the impression read.csv would rather have quotes as "" instead of \" but the proposed solutions sound kind of awkward. I tried just a search and replace in text editor. (First \\ to _ in case there were any strings ending with \\" and then \" to "" For my purposes, it doesn't matter if the punctuation in the text changes.) So far this seems promising.
posted by RobotHero at 12:11 PM on October 29, 2014


This post provides another option for getting the file into R so you can work with it there.

If you're still playing around with it in a text editor, make sure you are using the latest version of Calc if you have less than a million rows. Otherwise it looks like Base might be useful (same link).
posted by ropeladder at 12:56 PM on October 29, 2014


Best answer: R is usually very good at handling .csv in my experience, so I am surprised it's hiccuped here. Anyway, there is a tool out there called OpenRefine (formerly GoogleRefine) which is good for data cleaning and manipulation, especially if you are at the too-big-for-Excel level.
posted by PercussivePaul at 1:24 PM on October 29, 2014


Best answer: Try using readChar() to get the whole thing in, use gsub() to replace backslashes with blank characters, and then read.csv.

f = "c:/users/foo/foo.csv"
tmp = readChar(f, file.info(f)$size)
tmp1 = gsub("\\{1,}","", tmp, perl=T)

write.csv(tmp1, "newfile.csv")

(replace the equals signs with the assignment arrow)

I haven't tested the regex, but that should replace all backslashes with nothing.
posted by zug at 1:38 PM on October 29, 2014


Ooh I fought with a very similar problem for hours earlier this term. Here's what I did to get around it:
data = read.table(fileName, sep=",", quote="\\\\\"",header=T)
Instead of reading it in as a .csv, read it in as a generic table and specify that the separator is a comma. Then you can specify what strings are used for the quotes - I *think* what I have there should work but you might have to play with the escape sequences if that doesn't do it. The escape sequence was very unintuitive for me but I finally got that sequence to work for double-quotes in text strings.
posted by dialetheia at 3:23 PM on October 29, 2014


Best answer: It might or might not be useful for your case, but csvkit has various utilities for working with CSV data, including a utility for cleaning up some common CSV file problems.
posted by StrawberryPie at 4:09 PM on October 30, 2014


Response by poster: After some sanity tests, (for example, every entry in every numeric column is numeric) I'm convinced replace in Notepad++ was enough for my purposes this one time. But I also marked as best answer ones with promising data-cleaning tools that could prove useful in the future.
posted by RobotHero at 4:30 PM on October 30, 2014


« Older Traffic Information and my Calendar   |   Private plane Newer »
This thread is closed to new comments.