Export gmail without nested replies
December 4, 2021 10:15 PM   Subscribe

I can export gmail search results to a .mbox file and convert that file to text. Most of the email messages are threads with multiple "replies", which are then repeated in the text document. So, what should be about 350 pages, is instead 1940 pages - and I REALLY don't want to delete all the extras one at a time. Can this be done or is there a workaround?
posted by she's not there to Computers & Internet (4 answers total) 1 user marked this as a favorite
 
Easiest thing is skip the conversion-to-text step, and just use an email client like Thunderbird that has good heuristics for hiding quoted material to read the mbox directly.

The trouble is that there is no algorithm for identifying quoted material in an email chain; heuristics are as good as it gets, and no matter how good the heuristic is, it's eventually going to hide something it shouldn't have or expose something it should have hidden. Ultimately it is going to be eyeballs that make the final call, regardless of how much help the tools give you.
posted by flabdablet at 10:56 PM on December 4, 2021


Getting rid of lines starting with '>' (but not '>From') and those hassles are only a fraction of the battle. Those email messages probably have a 'text/html' and maybe no 'text/plain' or vise-versa or they have both. They also may be encoded in quoted-printable or some other weird encoding. You basically have to write a program and use some libraries to handle the myriad of myriad of "text" that can make up a MIME email message.

Yeah, the heuristics are the thing. People tend to quote wrong, they include the whole original and either add to the top or add to the bottom. You could catch this pretty easy. A whole chunk of '>' or '> >' or '> > >' madness can go away. But you probably want to keep short chunks of quoted material that's interleaved with the response. All answering your list of questions like. The heuristics should be able to rip out the "just quoted the entire message" things from the useful quoted material???

(People forgot how to use email when MIME text/html came into fashion. It was much easier when everything was text/plain.)
posted by zengargoyle at 11:46 PM on December 4, 2021


Wouldn't a weaker heuristic that strips out only quoted texts that match the *entirety* of a previously sent message be effective. Sure you're gonna have to figure out some of html/text issues as above (honestly, I don't know much about that component) but that type of regex seems feasible, wouldn't likely remove useful quotes, and would still cut down on the length of the printed/archived message chains. Again, not sure on some of the technical details but still can see a "good is better than nonexistent perfect" solutions
posted by DeepSeaHaggis at 1:01 AM on December 5, 2021


zengargoyle makes some good points. Modern (by which I mean roughly anything from the last two decades) email is only kinda text. Really, modern emails should be understood to be "RFC2045 MIME documents", which just happen to be stored on disk using ASCII.

Trying to edit a large mail archive in a .mbox using text-processing tools is likely to be frustrating. First, it's relatively easy to break the mailbox file if you strip out the wrong lines, which will result in messages being silently joined (and for one message to apparently 'disappear'). Second, you can also break the MIME formatting of the message and its sub-parts, which will cause display issues at the least, and information loss at worst. Third, it's likely that a lot of the "fat" you want to trim out of the archive is encapsulated in quoted-printable or BASE64 encodings, and can't be easily manipulated by just looking line-by-line for character strings. And then there are likely messages formatted with ugly HTML...

So, some sort of parser script that properly handles and parses the .mbox file, and then inspects each MIME message individually, slicing and dicing as needed, is probably the "right" solution.

Is the goal to actually reduce the number of pages that these messages would take to print? Or is the goal to reduce storage space on disk? If it's the latter, .mbox files compress quite well (particularly if there's a lot of duplicated content)... slapping the .mbox in a .zip or .gz will likely save ~40-50% of its storage footprint, maybe more, if the goal is long-term storage as data.

Or is the goal to make the messages easier to read by a human? Loading the .mbox file into any half-decent email program (Apple Mail, Thunderbird, whatever) should make it a lot more pleasant to actually review. Most mail programs will show you message threads within a folder (.mbox file) and only display the unquoted portions by default. (Arguably it's this behavior that causes people to continually topquote other people's messages.)

If loading them into a MUA for reading/review isn't an option, there are tools like Hypermail that will make browsable HTML versions of email archives. It won't reduce the storage footprint (in fact it'll likely increase slightly), but it's a lot more pleasant to read than scrolling through a plaintext dump of the .mbox!
posted by Kadin2048 at 1:58 AM on December 5, 2021 [2 favorites]


« Older Help us plan 24-48 hours in Chicago   |   Acid reflux, PPIs, and longer-term health Newer »
This thread is closed to new comments.