Convert rich text field in Access or Excel into plain text
November 12, 2013 11:25 AM   Subscribe

I have thousands of rich text message message fields currently formatted in RTF (rich text format) that need to be stripped of their formatting and converted to plain text. Currently the field is in a Microsoft Access database. But if needed, the database could be sent to Excel if that would ease the conversion of the RTF to plain text.

Here is an example of the text:

{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}}
\viewkind4\uc1\pard\lang1033\f0\fs17 Voice Mail Upgrade
\par
\par The wireless voice mail system in your area will soon be upgraded! As a result, ALLTEL will be able to offer you enhanced voice mail services in the future. Please note important dates and follow the instructions below to avoid any possible disruptions to your wireless voice mail service.
\par
\par Set Up Your Voice Mailbox in the New System
\par Between February 22 and March 4, please set up your greeting and security code on the new wireless voice mail system. Dial 1-800-834-9912, enter your 10-digit wireless number and interrupt the greeting by pressing the pound (#) key. Enter "9999" when you are prompted for a security code. A tutorial will step you through the set up of your mailbox. Please note that you will continue to use the current voice mail system, password and keystrokes until the new system is activated on March 6.
\par
\par Begin Using the New System on March 6
\par March 5 will be the last day of operation for the current voice

And quite simply, I want to remove all of the RTF formatting tags and leave only the text.

Thank you for your assistance.
posted by Jackie_Treehorn to Computers & Internet (16 answers total)
 
Okay, here's whatcha do:

- Send that data to Excel.
- Do a Ctrl+H, enter "\par " (sans quotes in the "Find" box and NOTHING in the Replace box. Do a Replace All.
- Insert a NEW column prior to the first one in the sheet. Enter in the number of each row (1 in row 1, 2 in row 2, alllll the way down - let me know if you need help with automating that).
- Do a Ctrl+A to highlight EVERYTHING.
- Sort your data by Column B (the column that contains the RTF tags)
- Delete every row that IS an RTF tag (they should sort out to the top).
- REsort your data by Column A (which is the "original" sort order).
posted by julthumbscrew at 11:30 AM on November 12, 2013


If the rtf lines always start with a "{" and end with a "}", you can find and replace them all with blanks by doing Ctrl+H, and putting "{*}" (again without the quotes) into the find box and nothing in the replace box. This will end you up with a bunch of blank lines, though, which you may not want -- julthumbscrew's suggestion of making a sorting column and then messing with things and re-sorting by the sorting column is a great trick.
posted by brainmouse at 11:42 AM on November 12, 2013 [1 favorite]


I'd drop it into a .txt file, then drop that into an Excel. I strip eeeeeverything in .txt.
posted by mochapickle at 11:45 AM on November 12, 2013


Response by poster: Thanks for the reply. I can see that you method removes the "\par" from the RTF fields, but that doesn't strip out the other RTF data such as

{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}}
\viewkind4\uc1\pard\lang1033\f0\fs17

from my example. That is, unless I'm missing something.
posted by Jackie_Treehorn at 11:46 AM on November 12, 2013


julthumbscrew described how to do that in everything after step 2 of her answer, starting with creating a new column to sort by at the end -- what I called the "sorting column" -- this is a column numbered 1-1000 (or however many rows you have) that you can use to sort by at the end. The easiest way to do that is by inserting a new column A, then putting a 1 in A1, a 2 in A2, then selecting both of those cells, then either drag down from the bottom right corner of that selection, or double click on the bottom right corner of that selection to fill down)

Then you can sort by your data column, and all the rows that start with {\ or \ or { or whatever will group together, and you can delete them. Then, you can re-sort by that sorting column you made before, and poof everything will be back in the right order.
posted by brainmouse at 11:54 AM on November 12, 2013


Do you want the gist of the text, or the full and complete text?

If these messages are all generated on fairly similar PCs in North America (excluding Quebec), the simple approach mentioned above will do a fairly decent job. If the messages come from different systems worldwide, your definition of plain text just got a bunch weirder.

The text encoding is defined in the header, and non-ASCII characters are represented by \'XX or \uXXXX (where XX and XXXX are hex values for 8-bit and Unicode characters). I forget how tables work; it's been more than a decade since I managed a publishing project that took RTF fragments from a database and converted them to an even more horrible format (Quark XPress Tags).

If these are memo fields in an Access database, there may be a Text() method that automatically converts them to plain text.
posted by scruss at 12:44 PM on November 12, 2013


I'm just looking at access 2007 and there is a feature (under the "fields" menu-memo settings) to convert the contents of a memo field from rich text to plain text so you may want to look at that
posted by canoehead at 2:41 PM on November 12, 2013


The other problem is that the sample you've provided, if it's accurate, has unbalanced parenthesis, which would make this a bit harder to solve with simple search and replace. RTF seems to be quite hard to strip reliably (see this discussion here).

If I've understood correctly, the RTF markup is being stored in a plain text field in your Access document. What I mean is that Access doesn't realise that this is rich text, and just thinks that someone laboriously typed "{", "\", "r", "t", "f", etc. Is this correct?

If I'm wrong, and Access knows this is rich text and displays it in this way, and you can actually get it into Excel with the formatting displaying correctly, formula assignment discards formatting implicitly - if you have italicised text in cell A1, "=A1" will just bring across the contents without the italics. If I'm right, yes, you have a parsing problem on your hands.
posted by curious.jp at 3:57 PM on November 12, 2013


Response by poster: @ curious.jp: I'm not experience at all with Access, but I'll try to answer your question. I'm viewing a table in Access with the information that I pasted above in one of the 'cells'. Access appears not realize the text is RTF. It shows the text just as pasted in the original question with no formatting being done at all.
posted by Jackie_Treehorn at 7:10 AM on November 13, 2013


Response by poster: @Canohead: I did see that menu setting in Access and tried changing from rich text to plain text. It didn't appear to change anything in that column.
posted by Jackie_Treehorn at 7:12 AM on November 13, 2013


Response by poster: Another thought. Can Excel have RTF cells? Could I just change that particular column in Excel to RTF and it would show the formatting correctly?

I could live with the RTF if it was displayed properly in Excel.
posted by Jackie_Treehorn at 7:22 AM on November 13, 2013


> It didn't appear to change anything in that column.

That's right, but when you export to .txt there are no formatting tags
posted by canoehead at 7:54 AM on November 13, 2013


Response by poster: @canoehead: Interesting, but I'm unable to export to a .txt document that doesn't have the formatting tags that are in the Access column.

Perhaps you could tell me the export settings used to remove the formatting tags?
posted by Jackie_Treehorn at 8:23 AM on November 13, 2013


When the export wizard starts, don't select "export data with formatting.." Then I chose "fixed width".

Just to clarify one thing , when you browse the table in access, can you see the formatting tags?
posted by canoehead at 8:39 AM on November 13, 2013


Response by poster: As I continue messing around with the exported document in Excel, I'm finding it may just be best to do multiple find and replace commands to weed out what I don't want.

The formatting tags are fairly consistent in all of the messages. With less than a dozen find and replace commands, I'm able to strip out all of the junk and leave what I want behind.

Although not totally resolved as far as a quick, automated solution, I'm making this topic resolved as I see some light at the end of the tunnel.

Thank you all for your help.
posted by Jackie_Treehorn at 8:41 AM on November 13, 2013


I'm glad you were able to solve it. A more general (but technical - you'll need to be comfortable with scripting a macro) can be found in the second part of Nossidge's comment here.
posted by curious.jp at 2:12 PM on November 13, 2013


« Older taking responsibility in a breakup   |   What are the exemptions to performing the Hajj? Newer »
This thread is closed to new comments.