how to strip out text from a CSV file record
November 12, 2013 9:25 AM   Subscribe

I have an exported CSV file which contains a field with an entire address in it like so: name, address, postal code, country. I need to strip out everything except the country name so I can sort by country.

I'm working in Libre Office and do not have Excel available to use. I'm thinking there is a quick hack I can do to strip text.

Logically, I need to look from the right of a line of text to find the first space and delete everything to the left of this.

Before: name, address, postal code, country
After: country

I'm thinking a word processor could do this, then I could paste the result back into the spreadsheet.

Thanks for any tips on how to do this.
posted by diode to Technology (13 answers total)
 
" I need to strip out everything except the country name so I can sort by country."

I don't have LibreOffice in front of me, but I'm pretty sure it can sort by any column you want without a problem. Alternately, if you really do want to delete those columns, why not just open up the CSV file, block-select the first four columns, and delete them?
posted by Tomorrowful at 9:34 AM on November 12, 2013 [1 favorite]


First, if everything is comma separated, save the file as a CSV instead of whatever format it is in, then open that CSV up in LO Calc, and it should separate the commas out into separate cells. then, clear the cells until you get the rightmost from each.

Tomorrowful, I think all the text is in one cell, so that doesn't work.
posted by deezil at 9:36 AM on November 12, 2013


Any chance you're on Linux (or on a Mac and willing to use the command line)? If so, there are a few programs to capable of doing this with the right commands. I think sed, awk, and cut could each do the job. I'm on a Windows machine right now, so I can't write and test a command to give you, but if this isn't answered when I get home from work, I might follow up.
posted by paper chromatographologist at 9:38 AM on November 12, 2013


I'm not totally sure I understand. You have a file that has multiple fields and ONE of them is this address thing, or that address block with the commas is the only thing on each line? If the latter, you should be able to open the CSV file in Calc (the Excel of Libre Office) and it will (possibly with some prompting from you) put all the values into separate fields. Then you can sort.

So you'd have

jessamyn, box 345 blablabla, 05060, usa

and opening it would give you

jessamyn | box 345 blablabla | 05060 | usa

If there is other cruft in there, the Word way to do this is select the text and choose "text to table" which I guess is a thing that Libre Office can also do.
posted by jessamyn at 9:38 AM on November 12, 2013


Response by poster: You have a file that has multiple fields and ONE of them is this address thing.

Yes, that's it. I have a CSV file that looks like this:
date, number, "address text block that includes commas", cost, weight

The text block field is the address I'm attempting to strip out everything except the country name, which can be two words, like Hong Kong or South Korea.
posted by diode at 9:43 AM on November 12, 2013


My suggestion: open (a copy) of the file in your spreadsheet editor, delete all columns except for the address section. Save the file. Open the file in a text editor and find-replace all quotes with nothing (effectively deleting all quotes). Open file in spreadsheet editor as csv file. Copy the country column and paste it back into original copy as a new column. Sort based on country.
posted by Green With You at 9:53 AM on November 12, 2013 [1 favorite]


You can do it all in LibreOffice Calc.

If the address is in cell A2, this will return everything to the right of the last comma, trimmed of any spaces:
=TRIM(RIGHT(A2,LEN(A2)-SEARCH(",[^,]*$",A2)))
You may need to enable regular expressions in formulas.
posted by scruss at 9:56 AM on November 12, 2013


Green With You 's answer is right, but I think it's missing a step. Save the file (with everything else stripped out) as a *.txt file. Open the file in Notepad and find/replace the quotes with nothing. Save it. Open in Calc as a comma-deliminated file. It'll turn it into a chart with each part as separate columns. Now, copy and paste the country fields back into your original document. They'll be in the same order.
posted by General Malaise at 10:00 AM on November 12, 2013


I don't know a thing about LibreOffice, but do you have a Text to Column feature? This is what I'd do in Excel, using the commas as delimiters. Then I'd eliminate the extraneous columns, leaving Country.
posted by Ruthless Bunny at 10:24 AM on November 12, 2013


Response by poster: Trying to separate the text block into columns does not work as the format for international addresses is all over the map. Domestic addresses perhaps this would work but for international addresses the addressing can be so variable that the end result is not good. The post by scruss seems like the way to do this, search for everything to the right of the last comma, trimmed of any spaces.
posted by diode at 10:42 AM on November 12, 2013


If you know any regex (it's fairly self-explanatory) you could do it in a minute with http://gskinner.com/RegExr/

You might find it even easier to combo with LibreOffice.
posted by turkeyphant at 10:43 AM on November 12, 2013


Response by poster: Okay, problem solved. Our IT guy put the text block into the vi editor, used macros to find the last comma, then delete the rest of the line and voila.
posted by diode at 10:59 AM on November 12, 2013


Put the text block into the vi editor, used macros to find the last comma, then delete the rest of the line and voila.

Glad your problem is solved, but be sure that you got the correct answer for all the edge cases. This problem is ... non-trivial to solve well. Read me.
posted by RedOrGreen at 3:16 PM on November 12, 2013


« Older Help me find plates that will last a decade   |   3D Printing: Can I Affordably Print Larger Than a... Newer »
This thread is closed to new comments.